Description of problem: For as-yet undetermined reasons, sometimes gam_server starts using lots of CPU time (~80% on my Athlon XP 1466 MHz) Version-Release number of selected component (if applicable): gamin-0.0.9-1 everything else current rawhide How reproducible: Not sure what triggers it; I think this is the second time I've seen it, but I didn't investigate the first time as I needed to boot into a new kernel then anyway (yes, I know, I'm a bad tester!) Additional info: attaching strace to the process yields these same (I verified with sort/uniq) two lines over and over in an infinitely repeating loop: stat64("/home/wes/.kde/share/config/ksmserverrc", {st_dev=makedev(9, 0), st_ino=99361, st_mode=S_IFREG|0600, st_nlink=1, st _uid=500, st_gid=500, st_blksize=4096, st_blocks=8, st_size=1492, st_atime=2004/09/09-17:55:06, st_mtime=2004/09/09-16:20:2 0, st_ctime=2004/09/09-16:20:20}) = 0 stat64("/home/wes/.kde/share/config/kioslaverc", {st_dev=makedev(9, 0), st_ino=163182, st_mode=S_IFREG|0600, st_nlink=1, st _uid=500, st_gid=500, st_blksize=4096, st_blocks=8, st_size=92, st_atime=2004/09/11-01:37:47, st_mtime=2004/08/25-23:49:21, st_ctime=2004/08/25-23:49:21}) = 0 I then sent it a SIGHUP and it went back to normal. Could be coincidence; does gam_server actualy restart on that signal? (I stupidly didn't think to leave strace attached when I did it) I will set up gam_server to run with GAM_DEBUG and --notimeout and all that; hopefully I can catch the behavior again.
Ok, I'm seeing it again. Unfortunately after two days of trying to catch it in debug mode, I had rebooted the system for updates and forgot to put gam_server in debug mode again :( This time it's looping against the following two (different than before) files: stat64("/home/wes/.kde/share/config/knotify.eventsrc", {st_dev=makedev(9, 0), st_ino=7314, st_mode=S_IFREG|0600, st_nlink=1, st_uid=500, st_gid=500, st_blksize=4096, st_blocks=8, st_size=1085, st_atime=2004/09/15-19:41:19, st_mtime=2004/08/29-20:53:09, st_ctime=2004/08/29-20:53:09}) = 0 stat64("/home/wes/.kde/share/config/kpilot_vcalconduitsrc", {st_dev=makedev(9, 0), st_ino=163235, st_mode=S_IFREG|0600, st_nlink=1, st_uid=500, st_gid=500, st_blksize=4096, st_blocks=8, st_size=57, st_atime=2004/08/25-23:49:33, st_mtime=2004/08/25-23:49:33, st_ctime=2004/08/25-23:49:33}) = 0 I see there is a new glibc in rawhide today, so I'm going to install it and hope it helps.
I am watching gam_server on rawhide 9-23-2004 system and it is using up around 50% of the CPU!
excellent, launch a gdb /usr/libexec/gam_server , attach with the PID of the process, look at what's happening and report. Knowing that you look at it or a syscall trace isn't that useful ! Daniel
I'm seeing it here as well. At most, gam_server is eating as much as 50-70% of the cpu.
http://www.gnome.org/~veillard/gamin/debug.html#Debugging1 debug the problem and provide a trace. Also make sure you have the latest version installed. What I said in comment #3 is still valid. Daniel
Ok, caught it again, this time on gamin-0.0.14-1 (which is current rawhide AFAIK) Using your fancy new SIGUSR2 debug trick, I get a quickly growing file with this line repeated forever: node_remove_subscription() It's nice that another SIGUSR2 turns it off again, because it was threatening to fill my disk :O Interestingly, now strace shows nothing, nada, nichts, rien. (well it shows the debug prints if that's enabled). Different problem causing the same high CPU usage, or just difference due to code changes you've made? Latest gdb backtrace, this time with debuginfo installed: #0 0x00135e42 in __i686.get_pc_thunk.bx () from /usr/lib/libglib-2.0.so.0 #1 0x00155944 in g_node_is_ancestor (node=0x8123018, descendant=0x8058498) at gnode.c:413 #2 0x0804af3a in gam_tree_remove (tree=0x80583c8, node=0x8123018) at gam_tree.c:144 #3 0x0804b7d3 in remove_directory_subscription (node=0x8123018, sub=0x811c4e8) at gam_poll.c:507 #4 0x0804cd56 in gam_poll_consume_subscriptions () at gam_poll.c:918 #5 0x0804fc64 in gam_dnotify_consume_subscriptions_real (data=0x0) at gam_dnotify.c:212 #6 0x0014e848 in g_idle_dispatch (source=0x8129f00, callback=0x8123018, user_data=0x8058498) at gmain.c:3802 #7 0x0014b4fb in g_main_context_dispatch (context=0x8057ee8) at gmain.c:1942 #8 0x0014cf82 in g_main_context_iterate (context=0x8057ee8, block=1, dispatch=1, self=0x8053018) at gmain.c:2573 #9 0x0014d22f in g_main_loop_run (loop=0x8059908) at gmain.c:2777 #10 0x0804aa28 in main (argc=1, argv=0xfefffa54) at gam_server.c:330 #11 0x001b7b03 in __libc_start_main (main=0x804a8f7 <main>, argc=1, ubp_av=0xfefffa54, init=0x8050304 <__libc_csu_init>, fini=0xfefff9e0, rtld_fini=0xfefffa54, stack_end=0xfefffa4c) at ../sysdeps/generic/libc-start.c:209 #12 0x08049fa1 in _start () Stepping through it in ddd/gdb, I notice that in gam_tree_remove, the g_node_is_ancestor sanity check seems to be consistently failing. To be specific, in g_node_is_ancestor, descendent->parent seems to always be null (only data and next are non-null). Somewhere the trees aren't getting built right, or are being systematically corrupted... I'm not resetting gam_server for the moment; email me if you want to telnet in and gdb it or X ddd out to your host to check it out, since it seems to be difficult to reproduce...
I'm about to go on the road... this helps, but you should not wait from me. thanks, Daniel
I again had the same problem too....not clear what triggers it. I was installing the new kernel rpm and it was taking for ever to install....when I did top I saw gamin taking all the cpu/ This has the potential of causing serious hangs.
I also got this problem using Fedora Core 3 test 3 on an ProLiant DL 145 (AMD64, Opteron) using x86_64, but gam_server always uses here 99,9% CPU constant. This causes a default load of ~ 1.5 - very bad. NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3440 root 25 0 6204 2268 4836 R 99.9 0.0 18:35.44 gam_server [root@fc3-test ~]# ps aux | grep gam root 3440 63.2 0.0 6204 2268 ? R 13:40 21:12 /usr/libexec/gam_server root 5081 0.0 0.0 42308 760 pts/3 S+ 14:14 0:00 grep gam [root@fc3-test ~]# [root@fc3-test ~]# rpm -q gamin --qf '%{name}-%{version}-%{release}.%{arch}\n' gamin-0.0.14-1.i386 gamin-0.0.14-1.x86_64 [root@fc3-test ~]# This problem really should be solved before the final release. This issue maybe also should be marked as possible blocker *suggesting*?!
gamin-0.0.14-1, current rawhide, x86_64 dual opteron. Output from SIGUSR2 shows it's looping through the following list of files. Looks like a poll_file loop is stuck. Quick source read points at gam_poll_scan_directory_internal and the for loop: for (l = children; l; l = l->next) { Poll: poll_file for /usr/share/applications/gnome-accessibility.desktop called at 1097770113 delta 0 : 0 Poll: poll_file /usr/share/applications/gnome-accessibility.desktop unchanged 1097700841 0 : 1097700841 0 Poll: poll_file for /usr/share/applications/redhat-neat-control.desktop called at 1097770113 delta 0 : 0 Poll: poll_file /usr/share/applications/redhat-neat-control.desktop unchanged 1096989180 0 : 1096989180 0 Poll: poll_file for /usr/share/applications/redhat-rhn-up2date-config.desktop called at 1097770113 delta 0 : 0 Poll: poll_file /usr/share/applications/redhat-rhn-up2date-config.desktop unchanged 1095979273 0 : 1095979273 0 And gdb confirms this: (gdb) bt #0 0x0000002a95721945 in ?? () #1 0x0000000000404701 in poll_file (node=0x523b40) at stat.h:366 #2 0x0000000000404b46 in gam_poll_scan_directory_internal (dir_node=0x0, exist_subs=0x0, scan_for_new=1) at gam_poll.c:446 #3 0x0000000000404f33 in gam_poll_scan_callback (data=0x5303d0) at gam_poll.c:550 #4 0x00000036cc52942b in ?? () (gdb) list gam_poll.c:446 441 } 442 children = gam_tree_get_children(tree, dir_node); 443 for (l = children; l; l = l->next) { 444 node = (GamNode *) l->data; 445 446 fevent = poll_file(node); 447 448 if (gam_node_is_dir(node) && 449 gam_node_has_flag(node, FLAG_NEW_NODE) && 450 gam_node_get_subscriptions(node)) { (gdb) print l $1 = (GList *) 0x51ea60 (gdb) print l->next $2 = (GList *) 0x523920 (gdb) print l->next->next $3 = (GList *) 0x51ea78 (gdb) print l->next->next->next $4 = (GList *) 0x51ea60 (gdb) p children $5 = (GList *) 0x53ddd0 (gdb) p *(GamNode *)l->data $6 = {path = 0x523c10 "/usr/share/applications/redhat-neat-control.desktop", subs = 0x0, data = 0x530210, data_destroy = 0x404380 <gam_poll_data_destroy>, flags = 0, node = 0x515f78, is_dir = 0} This list is not NULL terminated.
Thanks a lot ! This is what I was afraid of. The gam_server is not multithreaded anymore, so such corruption should not be the result of unguarded reentrancy. The children list is obtained by children = gam_tree_get_children(tree, dir_node); which does GList *list = NULL; [...] for (i = 0; i < g_node_n_children(node); i++) { list = g_list_prepend(list, NODE_DATA(g_node_nth_child(node, i))); } gam_tree_get_children() cannot loop, it should return a correct list. the loop in gam_poll_scan_directory_internal() just emits event and should not modify the list which is built as a temporary structure, l or related list data are not passed down to the recursive call to gam_poll_scan_directory_internal() I'm puzzled that we end-up with some corruption there. Reading g_list_prepend() code I don't see how this could fail. Except running gam_server under valgrind to try to track a random memory access error I don't see how to chase this in a deterministic way. Annoying, very annoying ! Daniel
gam_server goes nuts here too with rawhide.
Yesterday I looked briefly about at all list manipulation areas, but didn't see anything glaring either. Doesn't see like random memory access, however. The l->next pointers are valid, just creating a loop. I think the directory was being changed while this happened (during daily rawhide update). Is there any async event which could change the tree? AFAICT, gam_tree_get_children() expects the parent node to remain queiscent. You said it's not multithreaded, how about signal driven? Any way for gam_tree_add() to happen during a gam_tree_get_children so that the GNode sibling list changes while building the GList?
The only asynch event is the dnotify signal. It is handled by dnotify_signal_handler() which pushes the file descriptot number onto a GQueue and does a write to a local pipe. The pipe is hooked to the mainloop and pure synchronous processing should be done from there. There is a comment that GQueue changes is not signal safe and something else should be used. That's the only uncertaintie I can detect in the code, assuming there is only the main application thread running. The fact that the problem seems to occur frequently on your fast SMP box makes me wonder if there isn't something which still generate some kind of reentrancy. What puzzles me is that even if the node children list was modified during gam_tree_get_children() the list might get duplicate or wrong data pointers, but the l->next pointers should still be correct... Daniel
I rechecked the whole code path for children = gam_tree_get_children(tree, dir_node); and how it is walked. I still can't understand why those data which are local variables of the subroutines could generating a loop or modified to that effect. But to try to make progress I added sample trick code detecting loop in the children list within gam_poll_scan_directory_internal() to raise an error and break the loop if this happens. I released a 0.0.15 version with that workaround http://www.gnome.org/~veillard/gamin/sources/ I would be very interested in feedback about this for those who had troubles with 0.0.14 looping in their environment. I don't consider the problem fixed though, it is a workaround until I fully understand the problem. Daniel
0.0.15 looped for me today. Since I did not have the -debuginfo package, and yum did not like me, I can not provide further information.
I generated gamin-0.0.16 after fixing a couple of problems including one in tree handling. I hammered on it seriously and could not reproduce any kind of problem with it. I would very much appreciate if the people who managed to get the looping effect could upgrade to 0.0.16 and report if they manage to reproduce the problem again: http://www.gnome.org/~veillard/gamin/sources/ thanks, Daniel
gamin 0.0.16 still loops
I can confirm this too but for the life of me have no idea what triggers it. I am running smp kernel with hyperthreading and using kde/kdm as gui.
*** Bug 137439 has been marked as a duplicate of this bug. ***
Created attachment 105898 [details] trace
Created attachment 105899 [details] debug
Comment on attachment 105899 [details] debug May contain sensitve information, please respect privacy.
Created attachment 105934 [details] another gdb trace
well you would need the debuginfo for the gdb trace. g_pattern_match is called indirectly from poll_file() or node_add_subscription() or node_remove_subscription() Since your log seems to indicate it is looping on node_remove_subscription. This again seems to indicate an error looping on a corrupted node list that time a children list within a directory... Daniel
I've got the same problem with FC3 final. I am ripping CDs with grip, running Kdevelop and listening to noatun when it happens. I've running kernal 2.6.9-1.667smp. This is the first time I've seen it do this and I've been watching processes quite closely because I had an issue with artsd going nuts. (Sound problem.) I just killed gamserver in top. I think there were 2 gam_servers running. I killed the first PID and a second one jumped to the top of the list briefly. It had a different PID. Let me know if there is anything I can do to help.
As with comment#26, I have not seen it previously (FC3-RC5) FC3 final, x86_64 (Sun W1100z), 2.6.9-1.667 (single CPU) Usage: KDE desktop (kontact, konqueror, kdevelop etc. K3b --ripping four sets of FC3 CDs). /usr/libexec/gam_server can eat up to 99% CPU When switching on debug (kill -s SIGUSR2 pid), I see this: # tail -f /tmp/gamin_debug_phCf Queue Full Queue Full Queue Full Queue Full Queue Full Queue Full Queue Full node_remove_subscription(â(*) I will watch it closely from now on.
Same thing here. FC3 final, i686_32, kernel-2.6.9-1.667 Using KDE desktop.
I noticed that since I updated to FC3 final, many of my email messages end up being duplicated (I am using kmail with maildir format mailboxes). Can it be related to problems with gamin ?
This happens to me if I run more than 1 kmail (say, on 2 different machines using imap). In this case, nothing to do with gamin.
I have the same setup. Could it be KDE related?
I do not run KDE, and I do see this very rarely (not at all during the last two (three?) weeks). But if I recall correctly, k3b (which is about the only kde program I use) liked to trigger it.
0.16 fixed it for me, thanks very much
I just released 0.0.17 where I have tried to cope with possible loops in the second place where gam_tree_get_children() is called, and also made more changes and checkings in that function too. http://www.gnome.org/~veillard/gamin/sources/ I would appreciate if people having troubles could try that version and report ! thanks, Daniel
Could you release RPMs in rawhide ? Thanks, Philippe
They are built and may show up within a day, Daniel
I am using GNOME in FC3 Final, and I notice that gam_server is using 99-100% of CPU after I just ran K3B in GNOME.
try 0.0.17 see comment #34, Daniel
Just a quick "me too." FC3 release, fully updated as of this post. Running KDE, KMail, 2 instances of Konqueror as file manager, 2 idle command shells and Firefox 1.0. Most recent action: some file management stuff (moving them around). Also gedit. I'm not sure where to go to get "rpms in rawhide" (comment #35) but I'll look for it and install it if I find it.
> I'm not sure where to go to get "rpms in rawhide" http://fedora.redhat.com/download/updates.html This page explains the different stages of development and updates after a release of Fedora Core has gone out: - Fedora updates - Proposed Fedora (aka testing) - Development (aka rawhide) The 0.0.17 version of gamin is now in Fedora updates http://download.fedora.redhat.com/pub/fedora/linux/core/updates/3
After upgrading to 0.0.17, I no longer see big hikes in CPU usage like before. However, I just noticed that there has also been messages like this on my syslog for a while: # grep gam /var/log/messages Nov 16 20:10:45 foo kernel: gam_server[5241]: segfault at 0000000000000051 rip 00000000004038a7 rsp 0000007fbfffd3a8 error 4 Nov 17 07:16:11 foo kernel: gam_server[5844]: segfault at 0000000000000013 rip 00000000004038a7 rsp 0000007fbfffd3a8 error 4 Nov 17 08:02:32 foo kernel: gam_server[14902]: segfault at 000000000000000a rip 00000000004038a7 rsp 0000007fbfffd278 error 4 Nov 17 11:07:06 foo kernel: gam_server[25002]: segfault at 0000000000000013 rip 00000000004038a7 rsp 0000007fbfffd3a8 error 4 Nov 17 23:03:08 foo kernel: gam_server[4699]: segfault at 000000000000005c rip 00000000004038a7 rsp 0000007fbfffd3f8 error 4 Nov 18 06:58:10 foo kernel: gam_server[3431]: segfault at 000000000000005c rip 00000000004038a7 rsp 0000007fbfffd3f8 error 4 Nov 18 07:03:00 foo kernel: gam_server[3722]: segfault at 0000000000000008 rip 00000000004038a7 rsp 0000007fbfffd2c8 error 4 Nov 18 07:05:08 foo kernel: gam_server[4694]: segfault at 0000000000000050 rip 0000002a9557b920 rsp 0000007fbfffe480 error 4 Nov 19 22:09:06 foo kernel: gam_server[3447]: segfault at 000000000000005c rip 00000000004038a7 rsp 0000007fbfffd3f8 error 4 Nov 19 22:09:26 foo kernel: gam_server[3863]: segfault at 0000000000000013 rip 00000000004038a7 rsp 0000007fbfffd2c8 error 4 Nov 19 22:38:32 foo kernel: gam_server[3935]: segfault at 00000015000003f8 rip 0000002a9557b920 rsp 0000007fbffff6c0 error 4 Nov 19 23:30:51 foo kernel: gam_server[12659]: segfault at 00000006000003f8 rip 0000002a9557b920 rsp 0000007fbffff700 error 4 Nov 20 09:13:34 foo kernel: gam_server[3445]: segfault at 0000000000000061 rip 00000000004038a7 rsp 0000007fbfffd3f8 error 4 Nov 20 20:47:23 foo kernel: gam_server[3411]: segfault at 0000000000000047 rip 0000002a9557b920 rsp 0000007fbfffe470 error 4 Nov 20 22:20:34 foo kernel: gam_server[19964] general protection rip:4046bb rsp:7fbfffe640 error:0 Nov 20 22:20:45 foo kernel: gam_server[7190]: segfault at 00000060000003f8 rip 0000002a9557b6b1 rsp 0000007fbffff7c0 error 4 Nov 20 22:22:45 foo kernel: gam_server[10136]: segfault at 0000000000000066 rip 0000002a9557c3a4 rsp 0000007fbffff6a0 error 4 These have not gone away with 0.0.17.
I'm using gamin-0.0.17-1.FC3 and found this morning that gam_server was using 100% of one CPU on my dual-CPU machine. I do not see the segfaults Philippe posted though.
for crash or 100% cpu usage on 0.0.17 please follow the informations at http://www.gnome.org/~veillard/gamin/debug.html to try to provide feedback on what is happening. Daniel
*** Bug 140701 has been marked as a duplicate of this bug. ***
So this is interesting, I have a huge (5K+ files), unorganized directory of photographs on /mnt/ata0/www-images, and I don't have any .gamin config, so it's polling since it's in /mnt/*, and the log file viewed after using the SIGUSR2 signal confirms that. So I open up Konqueror (I'm using KDE) on that directory and see gam_server using 20% of one CPU, it's constantly polling. Then I open open one of the photos with Kuickshow, and use the page up/page down keys to move back and forth between images. Now gam_server's using 40% of one CPU. I open another Kwickshow and repeat the previous step and gam_server's utilization goes to 79%. Another Kwickshow, another 20% utilization. Is this just an optimization issue? I'm assuming Konqueror and Kwickshow are both gamin clients. Could gam_server be polling the same directory once for each client?
Oops, I meant to say that the utilization goes up 20% for each gamin client, I did not mean to say that it went from 40% to 79%, it always went up 18-20%.
w.r.t. comment #45 and #46, this is totally unrelated to the current bug, so please open a new bug report if you want feedback on this ! Daniel
OK, after a few days without crash or 100% CPU usage, it happened again. gamin-0.0.17-1.FC3 kernel-2.6.9-1.681_FC3 x86_64 KDE-3.3.1 (compiled from sources) CPU usage goes to the roof, freeze solid (cannot get a console or ssh into the box, ping responds though), then after 5 minutes goes back to normal (at this time, 'top' still shows a load of 26.00). Post-mortem (post-freezem actually) diagnosis: 1. /var/log/mesaages Nov 26 10:43:26 mybox kernel: oom-killer: gfp_mask=0x1d2 Nov 26 10:43:30 mybox kernel: DMA per-cpu: Nov 26 10:43:30 mybox kernel: cpu 0 hot: low 2, high 6, batch 1 Nov 26 10:43:30 mybox kernel: cpu 0 cold: low 0, high 2, batch 1 Nov 26 10:43:30 mybox kernel: Normal per-cpu: Nov 26 10:43:30 mybox kernel: cpu 0 hot: low 32, high 96, batch 16 Nov 26 10:43:30 mybox kernel: cpu 0 cold: low 0, high 32, batch 16 Nov 26 10:43:30 mybox kernel: HighMem per-cpu: empty Nov 26 10:43:30 mybox kernel: Nov 26 10:43:30 mybox kernel: Free pages: 1516kB (0kB HighMem) Nov 26 10:43:30 mybox kernel: Active:181 inactive:236594 dirty:0 writeback:235967 unstable:0 free:379 slab:13567 mapped:2424 pagetables:2311 Nov 26 10:43:30 mybox kernel: DMA free:4kB min:12kB low:24kB high:36kB active:0kB inactive:9788kB present:16384kB Nov 26 10:43:30 mybox kernel: protections[]: 0 0 0 Nov 26 10:44:06 mybox kernel: Normal free:1512kB min:1004kB low:2008kB high:3012kB active:724kB inactive:936588kB present:1031552kB Nov 26 10:44:52 mybox kernel: protections[]: 0 0 0 Nov 26 10:45:13 mybox gpm[2410]: *** info [mice.c(1766)]: Nov 26 10:47:03 mybox kernel: HighMem free:0kB min:128kB low:256kB high:384kB active:0kB inactive:0kB present:0kB Nov 26 10:47:07 mybox gpm[2410]: imps2: Auto-detected intellimouse PS/2 Nov 26 10:47:07 mybox kernel: protections[]: 0 0 0 Nov 26 10:47:08 mybox kernel: DMA: 1*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 4kB Nov 26 10:47:08 mybox kernel: Normal: 108*4kB 3*8kB 4*16kB 1*32kB 1*64kB 7*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1512kB Nov 26 10:47:09 mybox kernel: HighMem: empty Nov 26 10:47:09 mybox kernel: Swap cache: add 276677, delete 40720, find 4408/4496, race 0+0 Nov 26 10:47:09 mybox kernel: Out of Memory: Killed process 23735 (gam_server). FWIW, I customized /etc/sysctl.conf with these lines added: # Control shared memory size # (added for PostgreSQL) kernel.shmall = 134217728 kernel.shmmax = 134217728 # Do not overcommit memory vm.overcommit_memory = 2 2. in $HOME/.xsession-errors: gam_poll_scan_directory_internal(/home/user) loop detected gam_poll_scan_directory_internal(/home/user) loop detected gam_poll_scan_directory_internal(/home/user) loop detected ... 465 lines like this Hope this helps Philippe
Daniel, the SIGUSR2 trick is very nice, but as comment #6 point out, it is *very* verbose (can fill 100MB in minutes), so it will exhaust any reasonable partition pretty quickly, and the fact that it writes in /tmp means bad consequences for the system when this fills up. So it is not currently usable as a way to track gamin permanently. I have to turn it on only for short periods of time, and sure enough, these are not the times when things go bad. I suggest two improvements: 1- Have the log directory configurable (defaults to /tmp) 2- Configure a MAX_SIZE for a log file, after which logs are rotated, possibly with compressing old ones automatically. Thanks, Philippe
Okay, I have tried to track and change all usage of GList which may potentially result in the loop we are seeing. Basically the analysis is that list element are freed, put back in the free pool, reused, and then the pointer from the location where it was freed is modified. That's the only explanation I can find to get a loop in the lists. As a result I generated a new version with a lot of new cleanups maybe that time I got it for good. Version 0.0.18 is available as usual from the download page http://www.gnome.org/~veillard/gamin/downloads.html w.r.t. comment #49, the goal really is to find the bug, I don't think gathering days of logs is a good idea :-\ and since it is a race condition apparently (but how it is single-threaded) adding the debugging code is likely to just avoid the problem. Daniel
Thanks for the quick response. > Version 0.0.18 is available Downloaded, built x86_64 RPM and installed. Side note: in the changelog of src.rpm, there is no entry for 0.0-17.1 > w.r.t. comment #49, the goal really is to find the bug, I don't > think gathering days of logs is a good idea :-\ and since it is > a race condition apparently (but how it is single-threaded) adding > the debugging code is likely to just avoid the problem Well, currently gathering *any* data is pretty much impossible, given how fast it writes in /tmp. User has to turn debug off in a hurry, so the SIGUSR2 feature becomes sort of useless for users to help you. Besides, the goal of debug is to find any bug, not only this one I think. Cheers, Philippe
OK, it happens again as I speak gamin-0.0.18-1 consumes all CPU FC3 x86_64 kernel-2.6.9-1.681_FC3 x86_64 Using KDE 1. Top top - 17:55:19 up 5 days, 6:29, 6 users, load average: 1.21, 0.62, 0.24 Tasks: 94 total, 2 running, 92 sleeping, 0 stopped, 0 zombie Cpu(s): 98.7% us, 1.3% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si Mem: 1024700k total, 980336k used, 44364k free, 159164k buffers Swap: 1534168k total, 808k used, 1533360k free, 437340k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 23049 foobar 25 0 6824 2672 5060 R 97.2 0.3 2:52.77 gam_server 5144 root 15 0 213m 122m 93m S 1.7 12.2 36:01.05 X 19834 foobar 16 0 154m 23m 150m S 0.7 2.3 2:06.04 kdeinit 1 root 16 0 4736 616 4524 S 0.0 0.1 0:02.16 init 2 root 34 19 0 0 0 S 0.0 0.0 0:00.39 ksoftirqd/0 3 root 5 -10 0 0 0 S 0.0 0.0 0:03.10 events/0 4 root 10 -10 0 0 0 S 0.0 0.0 0:00.00 khelper 5 root 15 -10 0 0 0 S 0.0 0.0 0:00.00 kacpid 42 root 5 -10 0 0 0 S 0.0 0.0 0:00.00 kblockd/0 2. kill -SIGUSR2 23049: the debug file is spitting this: Queue Full Queue Full Queue Full Queue Full Queue Full Queue Full Queue Full Queue Full Queue Full Queue Full 769 lines like this, it seems to print them by little groups every few seconds. 3. Syslog: $ sudo grep gam /var/log/messages Nov 29 14:14:29 lw1 kernel: gam_server[3353]: segfault at 00000001000003f8 rip 0000002a95690920 rsp 0000007fbffff6e0 error 4 Dec 1 16:56:22 lw1 kernel: gam_server[3903]: segfault at 00000006000003f8 rip 0000002a956906b1 rsp 0000007fbffff5b0 error 4 Cheers, Philippe
Queue Full is a report from the signal handler. There is more than 500 kernel events stacked for processing. Run gam_server under gdb, possibly started from a vt console to try to find why there is a segfault or where it is looping. http://www.gnome.org/~veillard/gamin/debug.html I will need a stack trace, this should be possible to find in your case. Daniel
Stack trace. note: when gamin-0.0.18 was built (from the src.rpm), it was linked with the copy of glib-2.0 that I compiled from sources together with my KDE. $ gdb gam_server 23049 GNU gdb Red Hat Linux (6.1post-1.20040607.43rh) Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu"...gam_server: No such file or directory. Attaching to process 23049 Reading symbols from /usr/libexec/gam_server...Reading symbols from /usr/lib/debug/usr/libexec/gam_server.debug...done. Using host libthread_db library "/lib64/tls/libthread_db.so.1". done. Reading symbols from /opt/kde3.3.1/lib64/libglib-2.0.so.0...done. Loaded symbols for /opt/kde3.3.1/lib64/libglib-2.0.so.0 Reading symbols from /lib64/tls/libc.so.6...done. Loaded symbols for /lib64/tls/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /lib64/libnss_files.so.2...done. Loaded symbols for /lib64/libnss_files.so.2 0x0000002a956905b7 in g_list_last () from /opt/kde3.3.1/lib64/libglib-2.0.so.0 (gdb) where #0 0x0000002a956905b7 in g_list_last () from /opt/kde3.3.1/lib64/libglib-2.0.so.0 #1 0x0000002a9569069d in g_list_append () from /opt/kde3.3.1/lib64/libglib-2.0.so.0 #2 0x0000000000403fc1 in gam_tree_get_children (tree=0x546000, root=0x524a40) at gam_tree.c:265 #3 0x00000000004043ba in remove_directory_subscription (node=0x51b340, sub=0x51d600) at gam_poll.c:559 #4 0x00000000004056b3 in gam_poll_consume_subscriptions () at gam_poll.c:998 #5 0x0000000000408a73 in gam_dnotify_consume_subscriptions_real (data=0x546000) at gam_dnotify.c:212 #6 0x0000002a9569423a in g_main_context_dispatch () from /opt/kde3.3.1/lib64/libglib-2.0.so.0 #7 0x0000002a95696617 in g_main_context_iterate () from /opt/kde3.3.1/lib64/libglib-2.0.so.0 #8 0x0000002a956969aa in g_main_loop_run () from /opt/kde3.3.1/lib64/libglib-2.0.so.0 #9 0x00000000004037c2 in main (argc=0, argv=0x0) at gam_server.c:340 #10 0x0000002a958314ca in __libc_start_main () from /lib64/tls/libc.so.6 #11 0x0000000000402b3a in _start () #12 0x0000007fbffff888 in ?? () #13 0x000000000000001c in ?? () #14 0x0000000000000001 in ?? () #15 0x0000007fbffffad1 in ?? () #16 0x0000000000000000 in ?? ()
>note: when gamin-0.0.18 was built (from the src.rpm), it was linked >with the copy of glib-2.0 that I compiled from sources together with >my KDE. Just to clarify, this is glib-2.4.7.
Hummm ... If you recompile stuff by yourself, this may raise problems that others using the pristine distro will not get. On the other hand you then have the opportunity to rebuild glib with the following configure flags which would help finding the exact location of the corruption: --disable-mem-pools --enable-gc-friendly then run again under gdb or valgrind. The problem comes from a corrupted memory pool. Daniel
I'm getting the same backtrace as comment #54, i386 up to date FC3 as of 1 Dec except for having gamin 0.18 installed.
Created attachment 107793 [details] yet another gdb backtrace I tried the SIGUSR2 trick but absolutely nothing happens. gamin seems to be sitting stuck yet using CPU...
This also happens in Nahant-1 which has 0.0.9-1. I'm running KDE, have done little more than read email (evolution then later kmail), report bugs (epiphany) and do konsole stuff. The system is an Evectra Pentium III 600EB (so no SMP, no hyperthreads) 128 Mb RAM (so lots of swapping). There doesn't seem to be a lot for me to add other than the datapoint RHEL4 is impacted. I was about to trash the system and was scouting round for valuables when I noticed it was somewhat sluggish. Note that I have a similar report (looping) on Evolution. When I saw Evolution was hogging the CPU this was pretty active too, and it may be that the real problem was Gamin but I blamed E because it was the most active at the times I checked.
w.r.t. comment #57 and #58 gam_tree_get_children basically does list = NULL; for all children list = g_list_append(list, children_data); and the stack trace shows the generated list gets corrupted ! The error is somewhere else, this can only be rationally explained if the list memory pool gets corrupted, and as I said in #56 running a specifically compiled glib version is the best way to reproduce the problem and catch it when it happens not the side effect.
Ouch. I can rebuild glib but I can't constantly run gam_server under gdb/valgrind. I can attach valgrind/gdb to gam-server after I notice it's gone loco though. The set up is that I am administering have multi user machines so I only see the aftermath - I don't know what steps are actually causing this nor can I force users to run things in a debug mode all the time.
Sitsofe, I assume you are running a normal Fedora Core kernel and the glib2 also coming from Fedora Core, right ? Did you reboot the machine after upgrading to 0.0.18 to be sure that no process used an old gam_server. I'm just trying to be 100% sure I'm not chasing something related to inotify or a different release. Daniel
Daniel, yes I am running a normal FC3 kernel (kernel-2.6.9-1.681_FC3 ) and normal glib2 (glib2-2.4.7-1 ). No I didn't reboot after upgrading to gamin 0.0.18 but I know it probably isn't an old gam_server because it's start time is Dec02 which is after the RPMs install time of Tue 30 Nov 2004 15:52:50 . However if you are unconvinced I suppose I can reboot all the machines and wait too see if this happens again...
Can people try 0.0.19 that I uploaded at http://www.gnome.org/~veillard/gamin/downloads.html I did yet another pass at checking all GList usage which could lead to any kind of List pool corruption, I minimized the set of GList API from gam_server to a very minimal set, I added a copy of GList implementation directly in the gam_server disabling memory pool, poisonning freed list items. I have been hammering it for a couple of hours, and I'm still unable to reproduce any crash or loop. Please try 0.0.19 and report, as I'm running dry over ideas concerning what is happening, or how to solve it, thanks, Daniel
Sure thing. I'm going away for a few days so it will be mid next week before I get back to you on this. Do you want machines to rebooted before reports are submitted back? My one and only thought on this is are people using any binary drivers? I don't think I've seen this 100% CPU usage happen (yet) on a machine without nvidia binary drivers on it...
I have nvidia drivers but the kernel module was built in my machine.
I have seen this on a machine without nvidia drivers. I do have vmware drivers (occasionally, not always), but I can not remember any coincidence between vmware being loaded and gam eating CPU time.
Sitsofe, killall gam_server as root after the upgrade would do. and it's unrelated to kernel drivers, Daniel
> Can people try 0.0.19 Updated. Now testing. I noticed that the /usr/libexec/gam_server process is *not* killed upon exit from my (KDE) graphical session. In other words, logging in/out repeatidly ends up using the very same process through different graphical sessions. Shouldn't _all_ user processes started with a graphical session be killed upon exit ? I will compile glib with --disable-mem-pools --enable-gc-friendly in the next few days with KDE-3.3.2. > Hummm ... If you recompile stuff by yourself, this may raise problems > that others using the pristine distro will not get. Beside the fact that "pristine" distro users _are_ getting it, this is _good_ as you noticed because non-standard build may help find/troubleshoot more bugs. And since you mention the word "pristine", let me tell you that if Redhat/Fedora's implementation of KDE were indeed pristine and not so crippled (i.e downgraded menus/apps/configs resulting in a poor man's common denominator with Gnome, inheriting in the process some of its _bad_ user-interface-guidelines like the infamous double-click-by-default), you would have more users running your distro. Many KDE users have gone away from Redhat at the time Bluecurve and the bright "unified-desktop" idea came out. Since I always compile KDE from sources, it does not affect me that much but I am a somewhat rare case of KDE enthusiast on Fedora. Last nut not least, did you consider using the C++ STL to replace glib in gamin ? Cheers, Philippe
> I noticed that the /usr/libexec/gam_server process is *not* killed > upon exit from my (KDE) graphical session. the server exits after 30 seconds without client connection. > I will compile glib with > --disable-mem-pools --enable-gc-friendly > in the next few days with KDE-3.3.2. Not needed, 0.0.19 has it's own copy of the GLib list code > pristine and KDE Not my business, I don't use it, my point is reproductability of report. I learnt for example that depending on the automake version something as simple as gamin 0.0.18 get compiled completely differently. Pristine mean that the bug report is valid of others using the distro. > did you consider using the C++ STL to replace glib in gamin the client side does *NOT* use glib. The client side of FAM was using C++ STL forcing all client to load the library :-( that's one of the reasons we rewrote the package altogether. The server side is a standalone program, based on glib because - we know glib well - I don't want to code in C++ Daniel
I have not seen a problem since installing 0.0.17 (see comment #39). Since I have nothing further to report, I am removing myself from the CC: list.
I've just been running 0.0.19 for about a half-hour, trying to reproduce this as well as bug 140920. So far it seems to be dramatically better than 0.0.18. I see at most 8% CPU utilization, even with many clients.
Hmm, I was able to trigger the 100% (well, 98-88%) CPU usage case again this morning. Not sure what I did, and I can't get it to happen again, but I did have to do a killall gam_server to recover. This is with 0.0.19.
>> I noticed that the /usr/libexec/gam_server process is *not* killed >> upon exit from my (KDE) graphical session. > the server exits after 30 seconds without client connection. Definitely not in my case. When I logout from KDE, /usr/libexec/gam_server does not exit (I watched it for 10 minutes before killing it). I have verified from a root shell that the user has no more processes on the machine (except /usr/libexec/gam_server). PS: Haven't reproduced the high-CPU usage yet with 0.0.19
Oops, after double checking, I *had* one stale process somewhere which apparently established contact with gam_server after I logged out. Killing it allowed gam_server to exit by itself now.
How about kdm? Are you running it?
> How about kdm? Are you running it? If the question is to me, the answer is yes (as root of course).
In versions up to and including 0.0.19, I've been able to trigger the 100% CPU useage problem by following these steps: I open a single Konqueror window on a directory that contains a lot of photographs. Then I open a photo with Kuickshow (right click image icon-> open with -> Kuickshow) and use the mouse scrollwheel over the Kuickshow image to move to the next or previous image in the directory. I usually open about 10 individual Kuickshow applications and use the mouse wheel to move between images in each one (not sure this moving between images is important, but I think it may be.) Finally I close all the images by selecting "Close All" from the KDE panel (having enabled the "Group Similar tasks" panel option) I, and other people at the organization I contribute my time to, have performed these same steps with RH9 and FC1 without seeing this problem. I'm using gdm
Please provide the informations about the process state, gdb stack trace, and fragment of log generated using SIGUSR2 as pointed out previously if you reproduce this with 0.0.19 . http://www.gnome.org/~veillard/gamin/debug.html#Debugging1 If you have a reproducable way to trigger this, then switch gam_server before the problem to debugging mode with SIGUSR2, make it hang following your recipe, kill it and provide the output debug found in /tmp as an attachment to this bug. thanks, Daniel
About comment#80 I cannot trigger the 100% usage by your recipe. Using KDE-3.3.2 compiled from sources (gcc-3.4.2) on FC3. I tried on two marchitectures (i386: P4/768MB-RAM and x86_64: Opteron/1GB-RAM), with a directory containing 1010 images (all of them ~100kB 600x400 JPEGs): open >10 kuickshow instances by right-clicking, scroll a bit in each, close all -> exit fine. Is this "a lot of photographs" by your standards ?
This has happend only twice after moving to 0.0.19, previously it would happen consistently. In my case, "a lot" == about 10000 files. (Don't ask me why they like to 'organize' their photos this way) I don't think the architecture should matter, but the machine in question is a dual PIII-800Mhz w 1G RAM running KDE on a vanilla, up-to-date, FC3 install.
Created attachment 108323 [details] debug output from /tmp from before the 100% utilization occurred
w.r.t. comment #84 I looked at the logs, you are doing 2 bad things: 1/ you ask FAM to watch a directory with 10,000+ files in it 2/ that directory is under /mnt 1/ means that when gam server needs to check for modifications it need to stat all files in the directory to check for changes, which amounts to 10,000 stat() call and checks 2/ gamin does not use the kernel notification API for directories which may be temporary mount files like /mnt/... so it uses a 1 second timeout and recheck every time for changes. the conjunction of 1/ and 2/ means gam_server spend its time checking your files. It's not really a software loop, not a bug but how it was designed at the moment. It is not the same problem as why this bugzilla entry was opened. You can probably avoid the problem by removing either 1/ or 2/ but I can't find a fix to your problem, based on the fact that kernel dnotify must not be used on /mnt/... files and that maintaining the FAM semantic on a 10,000 entry directory need to stat all entries in that directory if the kernel can't tell they were not modified. You're pushing the FAM API to the limit your computer can handle it, so this doesn't work well... Daniel
Yes, I believe I mentioned the conditions before. However, this problem didn't exist in FC1, it's only after upgrading to FC3 that I'm seeing it (all drives mounted under /mnt are exactly the same as what I was using in FC1, I didn't modify them when I installed FC3.) Also, the 100% utilization persists after I close the client, which I think must be a bug. For some reason I'm now able to reproduce this consistently using just one Konqueror session (no Kuickviews). I haven't changed any configuration or updated since upgrading to 0.0.19. I did reboot to see if that had any effect, but it didn't. For my part, I can easily disable gamin on /mnt/*, however this problem seems like a regression from FC1
I've got gam_server taking 100% of the cpu as well. It looks like it is triggered by a combination of a process that I wrote that collects data from a machine and puts it in the directory and accessing the same directory from Konqueror. This is just a guess on my part as I haven't thoroughly tested it. The directory now has 17,500 files, each about 250KB in size. I don't have time to run tests on this, I just thought I'd share that I've got a similar problem.
I've just seen this problem maui ~ 1001# uname -a Linux maui.ee.port.ac.uk 2.6.9-1.667 #1 Tue Nov 2 14:41:25 EST 2004 i686 i686 i386 GNU/Linux maui ~ 1002# uptime 20:43:49 up 17 days, 10:26, 3 users, load average: 1.60, 1.51, 1.09 Not seen until today !!!! Full FC3 install No help tying it down I'm afraid
I just experienced this on Fedora Core 3 Test 3, folloing a full "yum update" (gamin-0.0.17-1.FC3). Earlier in this bug, dnotify was mentioned. About 2-3 weeks ago I was experiencing some problems with Courier's IMAP server when running on very large Maildirs, and my searches lead me to some posts about that implied there were some basic deficiencies in dnotify on Linux. Perhaps there is some issue with these system calls?
Doesn't look like there is any resolution of this issue from reading the comments above. I have the same problem when dealing with large amounts of files (up to 100,000). It is pretty reproducible. Note that I am not trying to view the files themselves, but just look at directories which contain many files. I understand that, according to comment 85, using FAM/gamin on a directory with this many files is not advisable. But, I have seen no comments on how to turn it off. Is that possible? If so, what are the repercussions? Steps to reproduce: 1. Unzip archive with 10,000 - 100,000 files in it, into a folder. 2. View the folder (the machine thinks for a while, then reports the number of files in the folder. Shortly after this gam_server starts to take 100% of one CPU on a dual Xeon machine).
I am using gamin-0.0.15-1.x86_64 on a RHEL4-B2 machine, and I just had gamin max out. This is with the x86_64 install on a Athlon64 3000+ Processor. I don't currently have time to troubleshoot, but as I didn't see Athlon64 mentioned before, I thought I would add the comment.
I suggest people try 0.0.20 as it has a potential fix for most corruptions raised so far. http://www.gnome.org/~veillard/gamin/downloads.html Daniel
I don't see any difference between .19 and .20 related to this bug. I did 'killall gam_server' before testing .20 and verified that I had a new PID before testing. When I closed the gam client, gam_server continued to run using 100% of one CPU until I killed it after about two minutes.
Ok, I (original reporter of this bug) haven't seen this bug in quite a while now. But now that I think about it... around the time FC3 went final, I rearranged my drive setup. 1. Everything (including / and /home) had been on a slow RAID 5 array of 5400 RPM IDE drives. When I installed FC3, I did a fresh install on a 10k SCSI drive. 2. I didn't move any of my junk over, just mounted the old array on /slow. $ ls -R /slow/home/wes | wc -l 31647 $ ls -R /home/wes | wc -l 2516 3. The old install had been continuously hand-upgraded (using rpm, not anaconda) since rhl8 or so. Now, looking at comment #85 from Daniel... I wonder if maybe the original bug I reported is indeed fixed. Now that I think about it, the later occurrences that I saw (and didn't bother adding to here, because I saw nothing new in them at the time), while it was still looping, it was looping over a large number of files, not the small number I saw at first. My slow drive array, coupled with gamin for some reason using the timer-based rescan instead of dnotify, might explain it. (But in my comment #6 above I note dnotify in the backtrace, don't remember whether that was a small or large # of files loop.) So questions for Daniel: A) Exactly what logic does gamin use to decide if it can use dnotify or not? Has that changed at all? (I wonder if something about my setup due to item 3 above caused it to misidentify paths on my home directory and not use dnotify) B) When gamin starts watching a directory, does it always do so recursively? (/slow/home/wes only has a few hundred entries itself, it's all the subdirs of stuff that makes it big) Happy New Year, everyone. Let's see if we can't get this bug closed before we hit 100 comments ;-)
I experienced the problem with gamin-0.0.24-1.FC3 in Fedora 3 on an HP compaq workstation with an Intel Celeron processor. For an unobvious reason gam_server starts to take 100% of the CPU time until it is killed along with nautilus.
I've just been subject to this bug. I'm using FC3+all updates. rpm installed is gamin-0.0.25-1.FC3 and I'm on an 64-bit AMD platform. I fixed the problem by restarting my X server. Killing the process on its own just resulted in it respawning and, after a short period, using nearly 100% cpu again. Unfortunately all I thought to do was strace the offending gam_server process. It was spinning trying to read the file /usr/local/share/applications/mimeinfo.cache and a couple of other files from that directory, which don't exist on my system. Their correct path does not have 'local' therein.
gamin sucks, how the **** do I turn it off??
I have this issue on my system too. gamim 0.0.25 on Fedora Core 3/AMD64. My workaround is sending a SIGSTOP (killall -19 gam_server), which will freeze the gam_server thread. As soon as the copy process has finished, I send a SIGCONT (killall -18 gam_server) again. gam_server then works just as expected, consuming just very few CPU time. I hope this bug will be found soon. It's rather annoying. :)
I'm running 32-bit fedora Core 3, with all the latest updates according to 'yum', and was getting the 99% CPU usage by gam_server after I added some pictures to one directory, and added some symlinks to some scripts in .gnome2/nautilus-scripts/ which I then used to play with those images. All the tricks above didn't work (including restarting it several times), but then I upgraded to 0.0.26-1, restarted, and everything went back to normal. I got the upgrade from: http://download.fedora.redhat.com/pub/fedora/linux/core/development/i386/Fedora/RPMS/
$ rpm -qf /usr/libexec/gam_server gamin-0.0.25-1.FC3.i386 I'm seeing the issue described in comment 98. I'll attach the debug output generated by sending the daemon SIGUSR2.
Created attachment 114906 [details] gamin debug output The strace output is looping like as below: stat64("/usr/local/share/applications/defaults.list", 0xbfffe70c) = -1 ENOENT (No such file or directory) open("/usr/local/share/applications/defaults.list", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = -1 ENOENT (No such file or directory) stat64("/home/jorton/.local/share/applications/mimeinfo.cache", 0xbfffe70c) = -1 ENOENT (No such file or directory) open("/home/jorton/.local/share/applications/mimeinfo.cache", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = -1 ENOENT (No such file or directory) stat64("/home/jorton/.local/share/applications/defaults.list", 0xbfffe70c) = -1 ENOENT (No such file or directory) open("/home/jorton/.local/share/applications/defaults.list", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = -1 ENOENT (No such file or directory)
try to update to 0.0.26 which fixes crashes and CPU consumption, or even better 0.1.0 which also fixes a bunch of other problems. Daniel
Do you have a copy of this built for FC3 somewhere?
> Do you have a copy of this built for FC3 somewhere? There should be an FC3 update if gamin-0.1.0 fixes problems. However, there is not fot the moment. You can just do an rpmbuild from the src.rpm. For Daniel: I noticed that the following files disappeared from gamin-devel in version 0.1.0: /usr/lib64/libfam.la /usr/lib64/libgamin-1.la and this breaks things with libtool. Therefore, upgrading FC3 to 0.1.0 for me is a no-no.
At the risk of being just another "me too", I'm getting this for the first time after upgrading to FC4 (from 3). gamin-0.1.3-1.FC4.i386.rpm Not sure what ths USR2 trick everyone's on about is; didn't do anything for me. In a week, I've caught it twice, once I strace'd it cuycling through an infinite loop of a directory's contents, the second gave zero output in strace. Should this bug still be tagged "devel"?
I am seeing excessive CPU usage often with the Ubuntu packes of 0.1.5 (0.1.5-0ubuntu1) also. gam_server regularly goes fubar and consumes 99% CPU time. It's highly irritating, it's come to that point that I have cron job that kills it every 15 minutes automatically. Anyway, I tried the SIGUSR2 trick and got the ouput, is there any other useful info I should attach here? I can't attach with gdb to the running process since everything is compiled without the debug option. Here's the output when debugging is turned on for gam_server, http://albin.abo.fi/~ninylund/dump/gamin-debug/gamin_debug_Op1icj http://albin.abo.fi/~ninylund/dump/gamin-debug/gamin_debug_Oszxi0 http://albin.abo.fi/~ninylund/dump/gamin-debug/gamin_debug_dFcUS5
Interesting... Are non Fedora/Red Hat gamin bugs OK here? Nikalus - did you really reproduce this problem with Red Hat's rawhide or did you just choose that because there was nothing else?
Oh bummer, didn't think about that this is redhat's bugzilla. There was just a link on gamin's homepage that took me here. I guess I didn't use rawhide, since I've never heard of such a thing.
I see this as well, on my one processor Thinkpad laptop. I haven't noticed a pattern that triggers the excessive CPU consumption, I just occasionally notice the CPU usage meter in my system tray get pegged at 100% (either all user time or all system time, the latter is probably gam_server calling poll in a tight loop).
This bug is "interesting". Originally there was a bug that caused gamin to really go into a tight 100% cpu loop, looping over a circular list. However, we now believe that the circular list bug is fixed, and other reports of this is mainly about gamin using lots of CPU when polling a large directory or when getting lots of change events from dnotify (e.g. when downloading something fast). It would be nice if people seeing this could try to determine what sort of problem they are seeing. I.E. When this happens, attach to gamin (with debuginfo installed) and see if its just spinning over one particular list forever.
What version do you think it was fixed in? Last one I investigated was comment#107
This happenned to me in FC4 with gamin-0.1.0-1.1. CPU usage would peg for a while, then be normal for a while, on the order of 30 to 90 seconds. I was running an application that reads and writes a couple large files at a time for several minutes. I was watching the directory in File Browser and clicked Reload in the browser several times, and deleted a few files at a time, a few minutes before I noticed the CPU usage. The directory has about 220 files, and is in the /data partition which is mounted under / It is a dual processor system. When I did kill -SIGUSR2 the CPU usage immediately went back to normal and remained normal. I'm attaching the beginning of the debug file.
Created attachment 123024 [details] beginning of gam_server log file from FC4 gamin-0.1.0-1.1
I am seeing this problem with gamin-0.1.7-1.1 (FC5T2 + current rawhide). I didn't have the debug package installed at the time. I will try to reproduce and send gdb info.
I am seeing this bug a lot when I use konqueror under Gnome to browse for files (ie: as a file manager) on my big NFS server. On a 2.4 P4 with 2GB of RAM it'll go up to 70-90% CPU and stay there for dozens of seconds. It'll do this even when I'm not doing anything with the folders that Konqueror is browsing. It definitely is Konqueror triggering it because I don't use KDE apps normally, and didn't used to, and never had this problem before. I'm not sure if the fact I'm browsing a 2TB NFS server is exacerbating the problem. FC3 gamin-0.1.1-3.FC3 This isn't just a FC bug: http://www.gnusolaris.org/cgi-bin/trac.cgi/ticket/60 http://www.irclogs.ws/freenode/kde/30Oct2005/13.html I just renamed and killed the gam_server binary and now all my gnome apps are going nuts using 100% CPU: 27852 trevor 25 0 17664 3616 3120 R 16.7 0.2 0:44.11 gnome-settings- 30113 trevor 22 0 232m 65m 32m R 16.4 3.2 30:49.73 soffice.bin 10130 trevor 25 0 26320 8692 5360 R 14.4 0.4 1:34.74 gnome-panel 27932 trevor 25 0 19260 2484 2152 R 13.8 0.1 0:14.60 gnome-vfs-daemo 10674 trevor 25 0 142m 59m 15m R 13.1 2.9 13:41.91 galeon 27918 trevor 25 0 44808 5988 4832 R 12.8 0.3 1:00.80 nautilus Seems to be stuck this way. As I kill those apps one by one the others take up the slack to use up 100% CPU. Guess I have to restore the file. The other thing, is I swear that over a week or two of an uninterrupted X session that gam_server goes nuts more frequently. It may just be my perception though.
Why is this 'file modification detection server' polling directories? This seems to be suicide if there are large numbers of child nodes... why doesn't it just listen to events fired by the kernel?
Some filesystems (i.e. NFS) don't fire events.
I saw this problem using 2.6.15-1.1833_FC4, gamin-0.1.1-3.FC4 . I was using firefox-1.0.7-1.2.fc4 to download a 132 MB file from a server on our local 100 Mb lan. I am in gnome at the time. Here is a snippet from top: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11222 dacker 16 0 123m 52m 20m R 30.2 5.2 2:27.62 firefox-bin 11067 dacker 16 0 2464 1208 872 R 19.6 0.1 0:01.93 gam_server 11083 dacker 15 0 35116 17m 11m S 6.0 1.7 0:03.45 nautilus This happens if I save the file to the desktop. If I save it to my home, things get better: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11222 dacker 15 0 123m 52m 20m S 30.3 5.2 3:19.01 firefox-bin 11067 dacker 15 0 2464 1208 872 S 10.6 0.1 0:26.35 gam_server
I am seeing this on fc5 - and I don't use NFS at all. It is just the default(?) LVM2 root with ext3 and boot with ext3, and a few sshfs/fuse mounted.
See also bug 196444 with regards to constant 4% CPU usage for no reason and *huge* mem leak. Probably related. Bugs confirmed in FC3, 4 and 5. gam_server blows.
(In reply to comment #120) > Some filesystems (i.e. NFS) don't fire events. Perhaps, but even NFS must call though the kernel I believe. Unless there is really some exception to this (even on newer kernels), it seems like gam_server needs to switch into some kind of a callback mode and notify its client via those messages... instead of polling directories. Retain the polling mode for really old kernels, that's fine..
I ran a few tests. Killing all konqueror and any other KDE apps I could find (there were none) didn't help. gam_server still eats a constant 3.5-4.5% CPU on my 2.4 P4. I tried killing/restarting gam_server after that and it immediately starts up again and still eats 3.5-4.5%. I can't easily umount my NFS as it's used for important server processes.
Another possible common thread: another reporter says they have a 2TB fs (local). I have a 2TB fs (non-local) mounted over NFS. Do other reporters have large fs's? Also, I run SMP kernel.
I want to add a "me too": RHEL4 on quad-cpu x86_64 server, I have a 2TB+ volume (over lvm2) as well, gamin-0.1.1-3.EL4 gamin-0.1.1-3.EL4 I run NFS server, but there are also some directories NFS-mounted to this server.
I see I put myself in the cc-list in May, but honestly I haven't seen the problem recently (on FC5, still only local LVM volumes < 80GB + sshfs/fuse mounts). I assume that the bug has either been fixed since, or that I have somehow managed to work around it - I have switched off nautilus in the session manager, for example, since I don't use nautilus at all anyway.
Another "me too" here. TwoDual core CPUs x86_64 workstation running 2.6.9-34 kernel (Red Hat Es 4 Workstation) with 8Gb RAM. I have a couple of 2+Tb LVM arrays. gam_server using 85-95% of one CPU. Not sure when this issue started or what set it off.
Another "me too". Dual Xeon fileserver running 2.6.9-34.0.1.ELsmp with 4Gb RAM. Attached to FC SAN (multi-TB). Many other servers NFS-mount and CIFS-mount to this box. gam_server using 80-95% of one CPU constantly.
me too. gamin-0.1.1-4.EL4 thunderbird-1.5.0.10-0.1.el4.centos kernel-2.6.9-42.0.8.EL i used to see this alot on a prior system running courier-imap. running dovecot on this one. wasn't seeing the gam_server hang much recently until recent yum update when among others thunderbird upgraded (to above) from thunderbird-1.5.0.9. now i frequently find gam_server eating all available CPU. kill it, and it immediately gets respawned. kill thunderbird, and gam_server quiets down. relaunch thunderbird and things are fine again for a couple days. if you want more info let me know what would be helpful.
This is not a "Medium" problem - it is a very severe problem. In years of running Linux, I have never seen a package perform like this - its practically a virus. gam_server constantly causes problems, and I must renice it. It seems to have a terrible interaction with KDE. This has been going on for too long, a package needs to be created to remove this software or it needs to be fixed! This is not a medium bug - just do a google search on gam_server!!!!!
i have reduced the effects of this bug by (1) every 15 minutes launch a cron job to renice all gam_server processes to bottom priority, and (2) backout thunderbird from 1.5.0.10 to 1.5.0.9, which seems for whatever reason to far less frequently encounter the bug. but of course, the bug remains.
Gamin is up to version 0.1.9 now in F8 and rawhide; F7 has 0.1.8, and even RHEL4 has been updated as far as 0.1.7. Older distro releases are EOL/in maintenance support at best (i.e. go grab an SRPM and update it yourself) There have been no new reports added to this bug in half a year. I personally haven't seen it in *years*. Is anyone experiencing this with a semi-current version of gamin? Or can we finally close this one?
We have just begun upgrading our compute farm to RHEL4 and we are seeing this problem. We have a fairly complex NFS setup with several netapps volumes. I'm not the sysadmin, so I don't have root access. To give a little insight, we have set up a single machine with freenx and are using it for our local site session server. I am only paying attention to this machine right now, but I know that other RHEL4 machines have had issues prior to this when users were starting VNC sessions before the freenx transition. We did not have this problem with RHEL3. lngl0116:/home/kbingham-> rpm -qf /usr/libexec/gam_server gamin-0.1.7-1.2.EL4 gamin-0.1.7-1.2.EL4 Here are the contents of the gaminrc file: lngl0116:/home/kbingham-> more /etc/gamin/gaminrc # configuration for gamin # Can be used to override the default behaviour. # notify filepath(s) : indicate to use kernel notification # poll filepath(s) : indicate to use polling instead # fsset fsname method poll_limit : indicate what method of notification for the filesystem # kernel - use the kernel for notification # poll - use polling for notification # none - don't use any notification # # the poll_limit is the number of seconds # that must pass before a resource is polled again. # It is optional, and if it is not present the previous # value will be used or the default. fsset nfs poll 10 # use polling on nfs mounts and poll once every 10 seconds Not all users are seeing this run out of control: lngl0116:/home/kbingham-> ps -eaf | grep gam_server ssirun 1096 1 0 2007 ? 00:05:31 /usr/libexec/gam_server hsales 2055 1 0 Jan02 ? 00:00:56 /usr/libexec/gam_server szanatta 2882 1 0 Jan03 ? 00:00:11 /usr/libexec/gam_server dreed 3710 1 0 Jan03 ? 00:00:22 /usr/libexec/gam_server jkoller 3801 1 0 2007 ? 00:00:12 /usr/libexec/gam_server jlawson 6022 1 0 Jan04 ? 00:01:40 /usr/libexec/gam_server wstrickl 8332 1 36 Jan05 ? 11:33:38 /usr/libexec/gam_server nphillip 9248 1 0 2007 ? 00:00:41 /usr/libexec/gam_server nmysore 13140 1 55 Jan04 ? 1-04:11:33 /usr/libexec/gam_server rkhan 13352 1 0 2007 ? 00:01:54 /usr/libexec/gam_server bonfanti 23065 1 0 Jan03 ? 00:00:21 /usr/libexec/gam_server bcruiksh 24066 1 0 Jan03 ? 00:00:08 /usr/libexec/gam_server mbarnes 24673 1 0 Jan02 ? 00:00:20 /usr/libexec/gam_server bgreiner 24736 1 54 Jan04 ? 1-01:59:05 /usr/libexec/gam_server kbingham 25283 1 0 Jan05 ? 00:00:01 /usr/libexec/gam_server kbingham 25419 21283 0 15:28 pts/4 00:00:00 grep gam_server mfalkinb 26903 1 0 Jan05 ? 00:00:01 /usr/libexec/gam_server lphillip 27510 1 0 Jan04 ? 00:00:38 /usr/libexec/gam_server jkeefer 29207 1 48 Jan05 ? 18:10:39 /usr/libexec/gam_server Any suggestions?
UPDATE to my comment #135: We did not see this previously: http://kbase.redhat.com/faq/FAQ_85_11914.shtm Our sysadmin is doing the upgrade to gamin-0.1.7-1.4.EL and we will see if we have any additional issues.
2.6.9-42.0.10.ELsmp #1 SMP Tue Feb 27 09:40:21 EST 2007 x86_64 x86_64 x86_64 GNU/Linux Sorry, Wes - I'm running 0.1.9 on a production server and experience the runaway problem. gam_server will behave for a few hours - sometimes a few days. I wrote a custom daemon that uses the fam-2.7.0 library. gam_server is, of course, required by fam. As a temporary solution a cron job now stops my daemon. Doing this is not enough, however. I still have to kill gam_server, and then restart my daemon. (I'm a little worried about the effect of killing gam_server in the midst of some operation). Is there a better alternative? My daemon monitors a few directories and triggers actions when files appear. gam runs in polling mode because I don't want to re-build the kernel on the production box. Maybe this isn't a problem if it runs from the kernel? I've tried the config file trick: /etc/gamin/gaminrc: fsset ext3 poll 5 Still, no joy.
Changing version to '9' as part of upcoming Fedora 9 GA. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
This message is a reminder that Fedora 9 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 9. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '9'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 9's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 9 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Fedora 9 changed to end-of-life (EOL) status on 2009-07-10. Fedora 9 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.