Description of problem: gam_server uses 100% of CPU: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5889 gordonf 25 0 3248 1208 960 R 100 0.1 37:02.33 gam_server Version-Release number of selected component (if applicable): > rpm -qa | grep gamin gamin-0.1.7-1.2.EL4 How reproducible: Boot a RHEL4 Update 5 machine, and after a while, gam_server starts using 100% of the CPU. Additional info: > ps fax PID TTY STAT TIME COMMAND 1 ? S 0:00 init [5] 2 ? S 0:00 [migration/0] 3 ? SN 0:00 [ksoftirqd/0] 4 ? S 0:00 [migration/1] 5 ? SN 0:00 [ksoftirqd/1] 6 ? S< 0:00 [events/0] 8 ? S< 0:00 \_ [khelper] 9 ? S< 0:00 \_ [kacpid] 32 ? S< 0:00 \_ [kblockd/0] 33 ? S< 0:00 \_ [kblockd/1] 51 ? S 0:00 \_ [pdflush] 52 ? S 0:00 \_ [pdflush] 54 ? S< 0:00 \_ [aio/0] 55 ? S< 0:00 \_ [aio/1] 435 ? S< 0:00 \_ [ata/0] 3228 ? S< 0:00 \_ [kauditd] 7 ? S< 0:00 [events/1] 436 ? S< 0:00 \_ [ata/1] 437 ? S< 0:00 \_ [ata_aux] 34 ? S 0:00 [khubd] 53 ? S 0:00 [kswapd0] 199 ? S 0:00 [kseriod] 441 ? S 0:00 [scsi_eh_0] 442 ? S 0:00 [scsi_eh_1] 478 ? S 0:00 [kjournald] 1754 ? S<s 0:00 udevd 3292 ? S 0:00 [kjournald] 3712 ? Ss 0:00 syslogd -m 0 3716 ? Ss 0:00 klogd -x 3730 ? Ss 0:00 irqbalance 3742 ? Ss 0:00 portmap 3762 ? Ss 0:00 rpc.statd 3797 ? Ss 0:00 rpc.idmapd 3908 ? Sl 0:00 ypbind 4074 ? S 0:00 /usr/sbin/automount --timeout=300 --use-old-ldap-look 4181 ? S 0:00 /usr/sbin/automount --timeout=300 --use-old-ldap-look 4209 ? S 0:00 /usr/sbin/automount --timeout=300 --use-old-ldap-look 4276 ? S 0:00 /usr/sbin/automount --timeout=300 --use-old-ldap-look 4364 ? S 0:00 /usr/sbin/automount --timeout=300 --use-old-ldap-look 4455 ? S 0:00 /usr/sbin/automount --timeout=300 --use-old-ldap-look 4469 ? Ss 0:00 /usr/sbin/acpid 4481 ? Ss 0:00 cupsd 4535 ? Ss 0:00 /usr/sbin/sshd 4550 ? Ss 0:00 xinetd -stayalive -pidfile /var/run/xinetd.pid 4568 ? SLs 0:00 ntpd -u ntp:ntp -p /var/run/ntpd.pid 4587 ? Ss 0:00 sendmail: accepting connections 4597 ? Ss 0:00 sendmail: Queue runner@01:00:00 for /var/spool/client 4608 ? Ss 0:00 gpm -m /dev/input/mice -t imps2 4618 ? Ss 0:00 crond 4642 ? Ss 0:00 xfs -droppriv -daemon 4661 ? Ss 0:00 /usr/sbin/atd 4691 ? Ssl 0:00 dbus-daemon-1 --system 4703 ? Ss 0:00 rhnsd --interval 240 4730 ? Ss 0:00 cfexecd 4740 ? Ss 0:00 cups-config-daemon 4751 ? Ss 0:16 hald 4826 tty1 Ss+ 0:00 /sbin/mingetty tty1 4829 tty2 Ss+ 0:00 /sbin/mingetty tty2 4863 tty3 Ss+ 0:00 /sbin/mingetty tty3 4931 tty4 Ss+ 0:00 /sbin/mingetty tty4 4966 tty5 Ss+ 0:00 /sbin/mingetty tty5 4968 tty6 Ss+ 0:00 /sbin/mingetty tty6 4970 ? Ss 0:00 /usr/bin/gdm-binary -nodaemon 5514 ? S 0:00 \_ /usr/bin/gdm-binary -nodaemon 5522 ? SL 2:33 \_ /usr/X11R6/bin/X :0 -audit 0 -dpi 96 -auth /v 5821 ? Ss 0:00 \_ -/bin/tcsh -c /usr/bin/ssh-agent /usr/bin/dbu 5868 ? S 0:00 \_ /usr/bin/gnome-session 5869 ? Ss 0:00 \_ /usr/bin/ssh-agent /usr/bin/dbus-laun 5820 ? S 0:00 [rpciod] 5872 ? S 0:00 /usr/bin/dbus-launch --exit-with-session /h/eol/gordo 5873 ? Ssl 0:00 dbus-daemon-1 --fork --print-pid 8 --print-address 6 5876 ? S 0:01 /usr/libexec/gconfd-2 5 5879 ? S 0:00 /usr/bin/gnome-keyring-daemon 5881 ? Ss 0:00 /usr/libexec/bonobo-activation-server --ac-activate - 5883 ? S 0:00 /usr/libexec/gnome-settings-daemon --oaf-activate-iid 5889 ? R 39:18 /usr/libexec/gam_server 5898 ? S 0:00 xscreensaver -nosplash 5926 ? Ss 0:08 /usr/bin/metacity --sm-client-id=default1 5930 ? Ss 0:02 gnome-panel --sm-client-id default2 5932 ? Ssl 0:01 nautilus --no-default-window --sm-client-id default3 5934 ? Ss 0:00 gnome-volume-manager --sm-client-id default6 5938 ? Ss 0:00 eggcups --sm-client-id default5 5940 ? Ss 0:00 pam-panel-icon --sm-client-id default0 5943 ? S 0:00 \_ /sbin/pam_timestamp_check -d root 5942 ? SNs 0:04 /usr/bin/python /usr/bin/rhn-applet-gui --sm-client-i 5947 ? Sl 0:00 /usr/libexec/gnome-vfs-daemon --oaf-activate-iid=OAFI 5958 ? S 0:00 /usr/libexec/mapping-daemon 5960 ? S 0:09 /usr/libexec/wnck-applet --oaf-activate-iid=OAFIID:GN 5962 ? S 0:03 /usr/libexec/stickynotes_applet --oaf-activate-iid=OA 5964 ? S 0:00 /usr/libexec/gnome-netstatus-applet --oaf-activate-ii 5966 ? S 0:00 /usr/libexec/notification-area-applet --oaf-activate- 5968 ? S 0:00 /usr/libexec/mixer_applet2 --oaf-activate-iid=OAFIID: 5970 ? S 0:00 /usr/libexec/clock-applet --oaf-activate-iid=OAFIID:G 6022 ? S< 0:00 [krfcommd] 13794 ? S 0:00 /bin/sh /usr/lib/thunderbird-1.5.0.10/thunderbird 13818 ? S 0:00 \_ /bin/sh /usr/lib/thunderbird-1.5.0.10/run-mozilla13823 ? Sl 0:20 \_ /usr/lib/thunderbird-1.5.0.10/thunderbird-bin13835 ? Ss 0:00 /usr/bin/esd -terminate -nobeeps -as 2 -spawnfd 38 13846 ? S 0:00 /bin/sh /usr/lib/firefox-1.5.0.10/firefox 13858 ? S 0:00 \_ /bin/sh /usr/lib/firefox-1.5.0.10/run-mozilla.sh 13863 ? Sl 18:07 \_ /usr/lib/firefox-1.5.0.10/firefox-bin 14260 ? S 0:07 \_ /usr/lib/acroread/Reader/intellinux/bin/a14385 ? Sl 0:03 gnome-terminal 14386 ? S 0:00 \_ gnome-pty-helper 14387 pts/1 Ss 0:00 \_ -csh 15678 pts/1 R+ 0:00 \_ ps fax 15547 ? Ss 0:00 cfservd 15562 ? Ss 0:00 cfenvd
We are seeing this behavior as well, on two systems both of which were recently upgraded to U5. We haven't yet been able to determine any reason that triggers the behavior, but we're seeing it on two different systems which are not that similar. One is a Dell 2950 running RHEL4 ES 64-bit, the other is a virtual machine running RHEL4 ES 32-bit on VMware ESX server on an IBM HS20 blade. Both of these systems regularly support GUI logins for remote users (via VNC or XDMCP) and it's the gam_server spawned by the gnome session which hangs. Running an strace on these hun gam_server processes gives us no output, whereas a normally functioning gam_server shows "poll" system calls at regular intervals. We can kill these hung gam_server processes, but they always return within a few minutes to a few hours.
I have managed to alleviate the problem with a workaround [1]. My /etc/gaminrc contains: --- # configuration for gamin # Can be used to override the default behaviour. # notify filepath(s) : indicate to use kernel notification # poll filepath(s) : indicate to use polling instead # fsset fsname method poll_limit : indicate what method of notification for the filesystem # kernel - use the kernel for notification # poll - use polling for notification # none - don't use any notification # # the poll_limit is the number of seconds # that must pass before a resource is polled again. # It is optional, and if it is not present the previous # value will be used or the default. fsset ext3 poll 10 # use polling on nfs mounts and poll once every 10 seconds fsset nfs poll 10 --- Gordon [1] http://ubuntuforums.org/showthread.php?t=210329&highlight=gam_server
So, your workaround is to switch from using the kernel notification mechanism to using polling. Not ideal, but at least it works for you atm. It would be interesting to get a backtrace of a gamin using 100% cpu with gamin-debuginfo installed. Also, did it work with the previous (0.1.7-1.1.EL4) version, or what did you upgrade from? Are all of you using NFS a lot?
I'll try to get you the backtrace as soon as possible. I didn't have any problems that I noticed with the previous version (0.1.7-1.1.EL4). My system was updated by up2date when 4.5 was released. I do have nfs mounts: > df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup00-LogVol00 64G 29G 32G 48% / /dev/sda3 99M 26M 68M 28% /boot none 505M 0 505M 0% /dev/shm xxx.xxx.xxx.xxx:/export/xxx 826G 276G 542G 34% /h/xxx xxx.xxx.xxx.xxx:/export/system/csoft_lnx 230G 46G 184G 20% /net/csoft_lnx xxx.xxx.xxx.xxx:/export/system/opt_lnx 230G 46G 184G 20% /net/opt_lnx
I'll try to reproduce this here. However, anyone that gets this in a production environment i recommend backing down to 0.1.7-1.1.EL4 for now (unless you want to help debug this). The fix in -1.2 was to help configuring poll times, so it is not critical to most people.
I wasn't able to reproduce this in some stress-testing here, using both NFS and ext3 filesystems. Here is some more information that would be nice to have from a system where this is happening: 1) A strace of the running gam_server 2) A debug log from gamin: send SIGUSR2 to gam_server, wait a while, then send it again, and the debug log should be in /tmp/gamin_debug_* 3) Kill nautilus and see if this helps Also, the workaround /etc/gaminrc is does two things a) it turns on polling on ext3 filesystems (fsset ext3 poll 10) b) it increases the poll rate from the default 5 secs to 10 secs (fsset nfs poll 10) It would be interesting to know which of these two changes work around the problem, or if both are actually needed.
We actually ran an strace on several of our hung gam_server processes last week and the output was absolutely nothing. It looked like a completely stuck/hung process, not output from strace at all. I'll try the debug log if/when it happens again. Interestingly, we had it happen multiple times last week on multiple systems, but so far this week not a trace. All of the systems affected did have moderately large (~500GB) ext3 filesystems with ~500,000 files. One had no NFS at all, the others mounted the large filesystem via NFS. Killing nautilus did nothing for the hung process. As part of our troubleshooting on the issue, we actually forced all users to logout and reverted to init level 3 and verified that all other GUI components unloaded. The hung gam_server processes stuck around until will killed them.
OK, got another hung one today. This is on a system with no NFS at all, just ext3 filesystems on LVM volumes. Running strace gives me this: # strace -p 18573 Process 18573 attached - interrupt to quit That's it, no other output at all. Sending SIGUSR2 to attempt to produce a debug log also produces no output. I can produce debug output on a "normal" gam_server process. It seems the process is simply spinning doing absolutely nothing. What else can I provide?
Don't know if this will help but here is a backtrace of my currently hung gam_server process: #0 0x0805398c in node_remove_directory_subscription (node=0x82ac958, sub=0x82ba670) at gam_poll_dnotify.c:423 #1 0x08053ee9 in gam_poll_dnotify_remove_subscription_real (sub=0x82ba670) at gam_poll_dnotify.c:565 #2 0x0805400a in gam_poll_dnotify_remove_subscription (sub=0x82ba670) at gam_poll_dnotify.c:603 #3 0x0804b4e2 in gam_poll_remove_subscription (sub=0x82ba670) at gam_server.c:498 #4 0x08052d1e in gam_dnotify_remove_subscription (sub=0x82ba670) at gam_dnotify.c:430 #5 0x0804b40a in gam_kernel_remove_subscription (sub=0x82ba670) at gam_server.c:457 #6 0x0804b160 in gam_remove_subscription (sub=0x82ba670) at gam_server.c:251 #7 0x0804a9b4 in gam_listener_free_subscription (listener=0x829af38, sub=0x82ba670) at gam_listener.c:99 #8 0x0804aa2b in gam_listener_free (listener=0x829af38) at gam_listener.c:128 #9 0x0804f6ea in gam_connection_close (conn=0x829b220) at gam_connection.c:119 #10 0x0804f0c5 in gam_client_conn_shutdown (source=0x8297aa0, conn=0x829b220) at gam_channel.c:686 #11 0x0804ecdb in gam_client_conn_read (source=0x8297aa0, condition=17, info=0x829b220) at gam_channel.c:270 #12 0x00b68907 in g_vasprintf () from /usr/lib/libglib-2.0.so.0 #13 0x00b4474b in g_main_context_dispatch () from /usr/lib/libglib-2.0.so.0 #14 0x00b461d2 in g_main_context_acquire () from /usr/lib/libglib-2.0.so.0 #15 0x00b4647f in g_main_loop_run () from /usr/lib/libglib-2.0.so.0 #16 0x0804b8e3 in main (argc=1, argv=0xbff64af4) at gam_server.c:647 #17 0x00669de3 in __libc_start_main () from /lib/tls/libc.so.6 #18 0x0804a355 in _start ()
Using gdb I told the node_remove_directory_subscription to return to it's caller and once I did this my stuck gam_server process began running again. An strace of the gam_server process looks "normal" (alternate poll and gettimeofday) system calls, and the process eventually exited normally (it's session had already ended). Hopefully this provides a hint about the problem.
So, it seems that it got stuck in the "for (l = children; l; l = l->next)" loop in node_remove_directory_subscription. However, that should only happen if that list was circular, but its a local copy of a list that was copied in gam_tree_get_children(), and since that copy succeeded it should not be circular. Can someone verify that we're stuck in that particular loop by taking a bunch of backtraces. I.E. connect to gdb, get a backtrace, run "c", wait a while, press ctrl-c, get a new backtrace and loop. If it turns out that all the backtraces are in node_remove_directory_subscription it would be nice to see exactly what its doing. Just run "n" a lot of time and post the log of that.
(In reply to comment #5) > I'll try to reproduce this here. However, anyone that gets this in a production > environment i recommend backing down to 0.1.7-1.1.EL4 for now (unless you want > to help debug this). The fix in -1.2 was to help configuring poll times, so it > is not critical to most people. I'm not sure that most of us normal users have access to 0.1.7-1.1.EL4 because, as far as I know, it was never part of an official release. It appears that prior to RHEL4 U5 the previous version was 0.1.1-4.EL4 and I'm sure most of us were running that version before the update to U5. Even if we had access to 0.1.7-1.1.EL4 it seems far more likely the bug was introduced in the move from 0.1.1 to 0.1.7 than in the small changes to polling (I'll admit to saying this wihtout having even looked at the changes set between 0.1.1 and 0.1.7 so I may be speaking foolishly here). I'll try to get the gdb information you asked for on my next hang.
Oh, sorry, i didn't know that 0.1.7-1.1 wasn't out. Yes, seems more likely the bug was introduced in that.
We have two systems that we recently updated to RHEL4 U5 that is seeing this same problem. There is a core file that gets created in the users top level directory prior to gam_server going into a loop. We're trying the /etc/gaminrc workaround suggested above on one of the systems to see if it'll help, but we'll leave the other one unmodified and try strace/gdb backtrace to see if we can get any useful output. I can provide a modified "strings" listing from the core files if that would be helpful.
FWIW, on RedHat 4 update 5 the workaround file mentioned above seems to cure the problem if located in /etc/gamin/gaminrc only. Cf. strings /usr/libexec/gam_server | grep gaminrc
I'll also report that the latest gamin from gnome (0.1.8) does not solve the issue either. Anecdotally, we had been running a similar version of gamin on another server without issue before RHEL4u5. Could this be a combination of kernel change and gamin version update?
a similar bug bugzilla #132354 from upstream .
I wonder what is common between these systems exhibiting this bug. Do you have large filesystems?
Our systems that have seen this bug have been regular desktops with no unusual hardware or very large local filesystems. They do mount a Netapp via NFS which has some filesystems of up to maybe 500 GB, however that was never an issue previously. They typically are single user systems that mount the Netapp and the primary function is software development (nothing too intense like Oracle). These systems are also running AFS. I'm not sure if many other people are running that, but it's worth mentioning.
The /etc/gamin/garminrc workaround previously suggested does not seem to work in our environment as we had another runaway gam_server on a system with this in place. A strace while the process is consuming 100% CPU produces no information (whereas when it is running normally it is quite chatty). Our environment does employs roughly 5 static mounts and occasional automounts with the largest remote filesystem being only 1Tb in size. Local disks (2) are only 160gb. Lots of segfaults of gam_server being recorded in the messages file (see below): May 23 15:25:42 abc kernel: gam_server[11549]: segfault at 000000000000008f rip 000000000040c61f rsp 0000007fbfffdd20 error 4 May 24 08:23:42 abc kernel: gam_server[24522]: segfault at 0000000000000060 rip 000000000040c61f rsp 0000007fbfffdd30 error 4 May 24 13:23:57 abc kernel: gam_server[25665]: segfault at 000000000000002b rip 0000003cc07234f4 rsp 0000007fbfffddc0 error 4 May 24 17:58:06 abc kernel: gam_server[12894]: segfault at 0000000000001d9b rip 0000003cc0723d56 rsp 0000007fbffff1b0 error 4 May 24 17:59:42 abc kernel: gam_server[15965]: segfault at 000000000000008f rip 000000000040c61f rsp 0000007fbfffda70 error 4 May 24 18:48:42 abc kernel: gam_server[15997]: segfault at 0000000000000049 rip 000000000040c61f rsp 0000007fbfffdd20 error 4 May 24 19:53:42 abc kernel: gam_server[16692]: segfault at 0000000000000083 rip 000000000040c61f rsp 0000007fbfffda70 error 4 May 24 20:35:42 abc kernel: gam_server[17255]: segfault at 000000000000003b rip 000000000040c61f rsp 0000007fbfffdd30 error 4 May 25 01:06:42 abc kernel: gam_server[17614]: segfault at 000000000000008e rip 000000000040c61f rsp 0000007fbfffdd30 error 4 May 25 08:57:10 abc kernel: gam_server[28073]: segfault at 000000000000001a rip 0000003cc07234f4 rsp 0000007fbfffdbb0 error 4 May 25 12:01:42 abc kernel: gam_server[28908]: segfault at 000000000000008e rip 000000000040c61f rsp 0000007fbfffdd20 error 4 May 25 13:10:42 abc kernel: gam_server[2178]: segfault at 000000000000008c rip 000000000040c61f rsp 0000007fbfffdd30 error 4 May 25 14:06:42 abc kernel: gam_server[2777]: segfault at 000000000000003f rip 000000000040c61f rsp 0000007fbfffdd20 error 4 May 25 14:52:42 abc kernel: gam_server[3991]: segfault at 000000000000007e rip 000000000040c61f rsp 0000007fbfffda70 error 4 May 25 15:56:42 abc kernel: gam_server[4513]: segfault at 0000000000000058 rip 000000000040c61f rsp 0000007fbfffdd30 error 4 Jun 4 12:04:47 abc kernel: gam_server[13024]: segfault at 0000000000000059 rip 000000000040c61f rsp 0000007fbfffda60 error 4 Jun 4 15:46:47 abc kernel: gam_server[15938]: segfault at 0000000000000087 rip 000000000040c61f rsp 0000007fbfffdd20 error 4 Jun 4 16:47:47 abc kernel: gam_server[18364]: segfault at 000000000000008b rip 000000000040c61f rsp 0000007fbfffdd20 error 4 Jun 5 13:41:07 abc kernel: gam_server[29176]: segfault at 0000000000000046 rip 000000000040c61f rsp 0000007fbfffda60 error 4 Jun 5 15:13:07 abc kernel: gam_server[751]: segfault at 0000000000000041 rip 000000000040c61f rsp 0000007fbfffdd20 error 4 Jun 5 17:09:07 abc kernel: gam_server[1937]: segfault at 0000000000000050 rip 000000000040c61f rsp 0000007fbfffdd30 error 4 Jun 7 16:29:01 abc kernel: gam_server[27451]: segfault at 0000000000000027 rip 0000003cc07234f4 rsp 0000007fbfffd7b0 error 4
Created attachment 156740 [details] Snipit of strace output monitoring runaway gam_server
Just had another runaway gam_server on the same system. This time I was able to get some strace info (see previous attachment). As you can see it seems to be looping in the user's .kde/share/config subdirectory, rather than checking other directories when the application is running normally. Only files in that subdirectory that changed today are: kdeglobals, korgacrc, kmailrc, uiserverrc, korganizerrc, emaildefaults. No segfault this time nor core file.
(In reply to comment #4) > I do have nfs mounts: > > > df -h > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/VolGroup00-LogVol00 > 64G 29G 32G 48% / > /dev/sda3 99M 26M 68M 28% /boot > none 505M 0 505M 0% /dev/shm > xxx.xxx.xxx.xxx:/export/xxx > 826G 276G 542G 34% /h/xxx > xxx.xxx.xxx.xxx:/export/system/csoft_lnx > 230G 46G 184G 20% /net/csoft_lnx > xxx.xxx.xxx.xxx:/export/system/opt_lnx > 230G 46G 184G 20% /net/opt_lnx The first NFS filesystem (/export/xxx) is ext3, and the second (/export/system/csoft_lnx) and third (/export/system/opt_lnx) NFS filesystems are gfs. Gordon
Created attachment 156769 [details] Debug output from kill -SIGUSR2 of runaway gam_server Just had another 'event'; same machine, same user. Attached is output from a kill -SIGUSR2 on the runaway gam_server. Hopefully this will have some useful information. Any suggestions/additional workarounds would be appreciated (a fix would be even better), as it is getting old having to keep killing the gam_server (we may end up nuking it if this keeps up). Thanks!
The posted workaround does not work for me either. The system we have that is experiencing this issue the most is has no NFS mounts but is a NFS server and exports a 300GB ext3 filesystem. Not huge certainly but not little either. I have attached the requested gdb output. Thanks, Tom
Created attachment 156776 [details] GDB output from stuck gam_server
> Any suggestions/additional workarounds would be appreciated (a fix would be > even better), as it is getting old having to keep killing the gam_server (we > may end up nuking it if this keeps up). Have you tried downgrading to the RHEL4 U4 version? We don't seem to have any problems when we run that version, although maybe it's just good luck. We've temporarily blacklisted it on the systems where we're having the problems although I had to put it back on one of the system to provide information to Redhat support.
There are two problems here. One is tom getting a real 100% cpu use loop, i.e. running strace on it shows nothing. The other problem is gamin doing lots of polling. This has been observed before, and seems to be related to something kde does (only happens with kde). I believe this is due to gamin doing polling to often on NFS, and can be fixed by configuring gamin to poll less often. In fact, the reason for this update was to add a way to globally set this option so that such problems could be fixed.
Looking at the last few backtraces from tom it really looks like its stuck in the for loop in node_remove_directory_subscription(). However, I still can't figure out how that could ever happen...
I would agree with #28 - there may be two bugs in here, both cause 100%CPU on gam_server, one while endlessly stat()ing some file (possible related to NFS), one in the for() loop. Suggest to split this..
I created bug 244633 for the case where strace shows a lot of polling. This bug is used for the "strace shows nothing" case.
I ran an overnight stress test of gamin, but couldn't reproduce this. It might need some specific condition to trigger. Any clues gladly accepted.
tom: Any chance you could verify that the children list in node_remove_directory_subscription() is/is not circular. When you break inside node_remove_directory_subscription, do something like p children p children->next p children->next->next p children->next->next->next etc, until you find a next pointer that is null, or one that has already been printed.
This is a long-shot, bug could people test this version: http://people.redhat.com/alexl/RPMS/gamin/gamin-0.1.7-1.2.EL4.0signalsafe.src.rpm It fixes use of malloc (which is signal unsafe) in a signal handler. It doesn't seem likely that this would cause circular lists, but its possible. Testing will see.
Sorry, that had a bug, please try instead this updated version: http://people.redhat.com/alexl/RPMS/gamin/gamin-0.1.7-1.2.EL4.0signalsafe.2.src.rpm
I've been running both of the updated gamin packages that Alex posted and haven't seen any problems yet. With the previous version I would see gam_server using 100% CPU time after a couple of hours.
Has anyone else tested the package i posted?
I have downloaded both your updated package and a different one provided by Redhat support, but so far have not tested either of them. For whatever reason I haven't seen the problem in the last two weeks. I can't really explain that as there have been no changed to the systems in question and previously we were seeing the issue almost daily. Anyway, I'm still willing to try the package you've posted here, but I'm not sure how valuable a success report would be since we're suddenly not seeing the problem.
We've been testing it on 3 systems and I just installed it on a 4th this morning. The one system that was experiencing the problem the most frequently had 4 failures in one day (core dumps) prior to installing your 2nd release, but none since then. So far, none of the systems have observed the problem with your 2nd release installed. As others have observed, we have some RHEL 4.5 systems that have never experienced this problem and others that either have it frequently or only once since our upgrades, so it's hard to know what triggers the race condition. It is encouraging, however, that with signalsafe.2 we've been running event-free, which is promising.
We have seen this bug also here, as users home directories are on NFS served disks. Is this patch going to be released soon as an official upgrade, or should I install the patched version instead?
I've just had this reported to me for the first time (and we have 850+ machines running RHEL4). What's interesting is that this particular machine is an EL3- >EL4 upgrade, which is not a particularly common configuration for us (mostly we force a reinstall)
Just like to chime in here to thank you for the signalsafe.2 update. It did the trick for us as well (~90 host deployment). We were seeing the same problems. CPU being pinned and regular segfaults.
I installed the signalsafe.2 update also on an EL4 machine which could easily reproduce the runaway process and so far it's good. I'd say get it released into the RHN channel as a fix right away. Can't be any worse than the package they have in there now. Thomas Walker
This bugzilla has Keywords: Regression. Since no regressions are allowed between releases, it is also being proposed as a blocker for this release. Please resolve ASAP.
As another data point, back in late June we had another few days where we had gamin runaway processes several times a day on a few servers so we went ahead and installed the signalsafe.2 packages. We've been running these packages for the entire month of July and we haven't experienced this issue again so it certainly seems this patch is an improvement for us. Thanks, Tom
Built as gamin-0.1.7-1.3.EL4
On upgrading to signalsafe rpms, gam_server uses 100% cpu intermittently. This event sent from IssueTracker by rkhadgar issue 128235
From gamin debug logs Exclude: use POLL on /media/* Exclude: use POLL on /mnt/* Exclude: use POLL on /dev/* Exclude: use POLL on /proc/* fstype ext3 monitor kernel poll timeout 0 ext3 filesystem mounted at /featemp ... Poll: poll_file for /featemp/mcrogers/ACAS/Job-1.cid called ... From gamin strace 6812 09:28:24 stat("/featemp/mcrogers/ACAS/Job-1.inp", 0x7fbfffdfc0) = -1 ENOENT (No such file or directory) ... It seems gamin was probably stuck in a loop trying to stat for non-existant file. This event sent from IssueTracker by rkhadgar issue 128235 it_file 98991
I compiled and installed: http://people.redhat.com/alexl/RPMS/gamin/gamin-0.1.7- 1.2.EL4.0signalsafe.2.src.rpm However, while this seems to have helped, I still ran into a looping gam_server. The backtrace for it always showed: (gdb) backtrace #0 0x00000032146b94a5 in _xstat () from /lib64/tls/libc.so.6 #1 0x0000000000410276 in gam_poll_dnotify_poll_file (node=0x5768d0) at gam_poll_dnotify.c:206 #2 0x00000000004049a5 in gam_poll_file (node=0x5768d0) at gam_server.c:505 #3 0x000000000040759b in gam_poll_generic_scan_directory_internal (dir_node=0x52fc50) at gam_poll_generic.c:349 #4 0x00000000004076e3 in gam_poll_generic_scan_directory (path=0x53db60 "/etc") at gam_poll_generic.c:399 #5 0x000000000040fa7d in gam_dnotify_pipe_handler (user_data=0x51b330) at gam_dnotify.c:315 #6 0x0000003216b266bd in g_main_context_dispatch () from /usr/lib64/libglib- 2.0.so.0 #7 0x0000003216b28397 in g_main_context_acquire () from /usr/lib64/libglib- 2.0.so.0 #8 0x0000003216b28735 in g_main_loop_run () from /usr/lib64/libglib-2.0.so.0 #9 0x0000000000404c37 in main (argc=1, argv=0x7fbfffb498) at gam_server.c:584
Sorry, I just figured out the cause of my previous report. There was a samba mount point in a hung state, which was also on the Desktop. If someone tried to access that mount point, i.e. by doing a df command, that would cause gam_server to start looping.
Hi All I have been running gamin-0.1.7-1.2.EL4.0signalsafe.2 for over a month, and have not seen this problem occur again. Gordon
*** Bug 244694 has been marked as a duplicate of this bug. ***
Hi all, I've been running this patch on several workstations now and it looks like the problem has been solved. Any idea when the official patch will be released ? Cheers, Stef
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0772.html