Bug 240154

Summary: gam_server using 100% of CPU (strace shows nothing)
Product: Red Hat Enterprise Linux 4 Reporter: Gordon Farquharson <gordonf>
Component: gaminAssignee: Alexander Larsson <alexl>
Status: CLOSED ERRATA QA Contact:
Severity: urgent Docs Contact:
Priority: medium    
Version: 4.5CC: brett.morrow, Colin.Simpson, dimitrios.gerasimatos, gunther.mayer, hyclak, jan.iven, kajtzu, linux_support, mrl, nilsson, nobody+bjmason, owen_steinert, pasteur, paul.boin, sam.sharpe, sa, srigler, stefan.hoelldampf, stephane_jourdan, tao, tatsu-ab1, tsc5yc, ttsig, voetelink
Target Milestone: ---Keywords: Regression
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0772 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-11-15 16:04:36 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 244633    
Attachments:
Description Flags
Snipit of strace output monitoring runaway gam_server
none
Debug output from kill -SIGUSR2 of runaway gam_server
none
GDB output from stuck gam_server none

Description Gordon Farquharson 2007-05-15 15:11:47 UTC
Description of problem:

gam_server uses 100% of CPU:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5889 gordonf   25   0  3248 1208  960 R  100  0.1  37:02.33 gam_server

Version-Release number of selected component (if applicable):

 > rpm -qa | grep gamin
gamin-0.1.7-1.2.EL4

How reproducible:

Boot a RHEL4 Update 5 machine, and after a while, gam_server starts using 100%
of the CPU.

Additional info:

> ps fax
  PID TTY      STAT   TIME COMMAND
    1 ?        S      0:00 init [5]
    2 ?        S      0:00 [migration/0]
    3 ?        SN     0:00 [ksoftirqd/0]
    4 ?        S      0:00 [migration/1]
    5 ?        SN     0:00 [ksoftirqd/1]
    6 ?        S<     0:00 [events/0]
    8 ?        S<     0:00  \_ [khelper]
    9 ?        S<     0:00  \_ [kacpid]
   32 ?        S<     0:00  \_ [kblockd/0]
   33 ?        S<     0:00  \_ [kblockd/1]
   51 ?        S      0:00  \_ [pdflush]
   52 ?        S      0:00  \_ [pdflush]
   54 ?        S<     0:00  \_ [aio/0]
   55 ?        S<     0:00  \_ [aio/1]
  435 ?        S<     0:00  \_ [ata/0]
 3228 ?        S<     0:00  \_ [kauditd]
    7 ?        S<     0:00 [events/1]
  436 ?        S<     0:00  \_ [ata/1]
  437 ?        S<     0:00  \_ [ata_aux]
   34 ?        S      0:00 [khubd]
   53 ?        S      0:00 [kswapd0]
  199 ?        S      0:00 [kseriod]
  441 ?        S      0:00 [scsi_eh_0]
  442 ?        S      0:00 [scsi_eh_1]
  478 ?        S      0:00 [kjournald]
 1754 ?        S<s    0:00 udevd
 3292 ?        S      0:00 [kjournald]
 3712 ?        Ss     0:00 syslogd -m 0
 3716 ?        Ss     0:00 klogd -x
 3730 ?        Ss     0:00 irqbalance
 3742 ?        Ss     0:00 portmap
 3762 ?        Ss     0:00 rpc.statd
 3797 ?        Ss     0:00 rpc.idmapd
 3908 ?        Sl     0:00 ypbind
 4074 ?        S      0:00 /usr/sbin/automount --timeout=300 --use-old-ldap-look
4181 ?        S      0:00 /usr/sbin/automount --timeout=300 --use-old-ldap-look
4209 ?        S      0:00 /usr/sbin/automount --timeout=300 --use-old-ldap-look
4276 ?        S      0:00 /usr/sbin/automount --timeout=300 --use-old-ldap-look
4364 ?        S      0:00 /usr/sbin/automount --timeout=300 --use-old-ldap-look
4455 ?        S      0:00 /usr/sbin/automount --timeout=300 --use-old-ldap-look
4469 ?        Ss     0:00 /usr/sbin/acpid
 4481 ?        Ss     0:00 cupsd
 4535 ?        Ss     0:00 /usr/sbin/sshd
 4550 ?        Ss     0:00 xinetd -stayalive -pidfile /var/run/xinetd.pid
 4568 ?        SLs    0:00 ntpd -u ntp:ntp -p /var/run/ntpd.pid
 4587 ?        Ss     0:00 sendmail: accepting connections
 4597 ?        Ss     0:00 sendmail: Queue runner@01:00:00 for /var/spool/client
4608 ?        Ss     0:00 gpm -m /dev/input/mice -t imps2
 4618 ?        Ss     0:00 crond
 4642 ?        Ss     0:00 xfs -droppriv -daemon
 4661 ?        Ss     0:00 /usr/sbin/atd
 4691 ?        Ssl    0:00 dbus-daemon-1 --system
 4703 ?        Ss     0:00 rhnsd --interval 240
 4730 ?        Ss     0:00 cfexecd
 4740 ?        Ss     0:00 cups-config-daemon
 4751 ?        Ss     0:16 hald
 4826 tty1     Ss+    0:00 /sbin/mingetty tty1
 4829 tty2     Ss+    0:00 /sbin/mingetty tty2
 4863 tty3     Ss+    0:00 /sbin/mingetty tty3
 4931 tty4     Ss+    0:00 /sbin/mingetty tty4
 4966 tty5     Ss+    0:00 /sbin/mingetty tty5
 4968 tty6     Ss+    0:00 /sbin/mingetty tty6
 4970 ?        Ss     0:00 /usr/bin/gdm-binary -nodaemon
 5514 ?        S      0:00  \_ /usr/bin/gdm-binary -nodaemon
 5522 ?        SL     2:33      \_ /usr/X11R6/bin/X :0 -audit 0 -dpi 96 -auth /v
5821 ?        Ss     0:00      \_ -/bin/tcsh -c /usr/bin/ssh-agent /usr/bin/dbu
5868 ?        S      0:00          \_ /usr/bin/gnome-session
 5869 ?        Ss     0:00              \_ /usr/bin/ssh-agent /usr/bin/dbus-laun
5820 ?        S      0:00 [rpciod]
 5872 ?        S      0:00 /usr/bin/dbus-launch --exit-with-session /h/eol/gordo
5873 ?        Ssl    0:00 dbus-daemon-1 --fork --print-pid 8 --print-address 6
 5876 ?        S      0:01 /usr/libexec/gconfd-2 5
 5879 ?        S      0:00 /usr/bin/gnome-keyring-daemon
 5881 ?        Ss     0:00 /usr/libexec/bonobo-activation-server --ac-activate -
5883 ?        S      0:00 /usr/libexec/gnome-settings-daemon --oaf-activate-iid
5889 ?        R     39:18 /usr/libexec/gam_server
 5898 ?        S      0:00 xscreensaver -nosplash
 5926 ?        Ss     0:08 /usr/bin/metacity --sm-client-id=default1
 5930 ?        Ss     0:02 gnome-panel --sm-client-id default2
 5932 ?        Ssl    0:01 nautilus --no-default-window --sm-client-id default3
 5934 ?        Ss     0:00 gnome-volume-manager --sm-client-id default6
 5938 ?        Ss     0:00 eggcups --sm-client-id default5
 5940 ?        Ss     0:00 pam-panel-icon --sm-client-id default0
 5943 ?        S      0:00  \_ /sbin/pam_timestamp_check -d root
 5942 ?        SNs    0:04 /usr/bin/python /usr/bin/rhn-applet-gui --sm-client-i
5947 ?        Sl     0:00 /usr/libexec/gnome-vfs-daemon --oaf-activate-iid=OAFI
5958 ?        S      0:00 /usr/libexec/mapping-daemon
 5960 ?        S      0:09 /usr/libexec/wnck-applet --oaf-activate-iid=OAFIID:GN
5962 ?        S      0:03 /usr/libexec/stickynotes_applet --oaf-activate-iid=OA
5964 ?        S      0:00 /usr/libexec/gnome-netstatus-applet --oaf-activate-ii
5966 ?        S      0:00 /usr/libexec/notification-area-applet --oaf-activate-
5968 ?        S      0:00 /usr/libexec/mixer_applet2 --oaf-activate-iid=OAFIID:
5970 ?        S      0:00 /usr/libexec/clock-applet --oaf-activate-iid=OAFIID:G
6022 ?        S<     0:00 [krfcommd]
13794 ?        S      0:00 /bin/sh /usr/lib/thunderbird-1.5.0.10/thunderbird
13818 ?        S      0:00  \_ /bin/sh
/usr/lib/thunderbird-1.5.0.10/run-mozilla13823 ?        Sl     0:20      \_
/usr/lib/thunderbird-1.5.0.10/thunderbird-bin13835 ?        Ss     0:00
/usr/bin/esd -terminate -nobeeps -as 2 -spawnfd 38
13846 ?        S      0:00 /bin/sh /usr/lib/firefox-1.5.0.10/firefox
13858 ?        S      0:00  \_ /bin/sh /usr/lib/firefox-1.5.0.10/run-mozilla.sh
13863 ?        Sl    18:07      \_ /usr/lib/firefox-1.5.0.10/firefox-bin
14260 ?        S      0:07          \_
/usr/lib/acroread/Reader/intellinux/bin/a14385 ?        Sl     0:03 gnome-terminal
14386 ?        S      0:00  \_ gnome-pty-helper
14387 pts/1    Ss     0:00  \_ -csh
15678 pts/1    R+     0:00      \_ ps fax
15547 ?        Ss     0:00 cfservd
15562 ?        Ss     0:00 cfenvd

Comment 1 Tom Sightler 2007-05-16 17:29:14 UTC
We are seeing this behavior as well, on two systems both of which were recently
upgraded to U5.  We haven't yet been able to determine any reason that triggers
the behavior, but we're seeing it on two different systems which are not that
similar.  One is a Dell 2950 running RHEL4 ES 64-bit, the other is a virtual
machine running RHEL4 ES 32-bit on VMware ESX server on an IBM HS20 blade.

Both of these systems regularly support GUI logins for remote users (via VNC or
XDMCP) and it's the gam_server spawned by the gnome session which hangs. 
Running an strace on these hun gam_server processes gives us no output, whereas
a normally functioning gam_server shows "poll" system calls at regular intervals.

We can kill these hung gam_server processes, but they always return within a few
minutes to a few hours.


Comment 2 Gordon Farquharson 2007-05-16 18:19:29 UTC
I have managed to alleviate the problem with a workaround [1]. My /etc/gaminrc
contains:

---

# configuration for gamin
# Can be used to override the default behaviour.
# notify filepath(s) : indicate to use kernel notification
# poll filepath(s)   : indicate to use polling instead
# fsset fsname method poll_limit : indicate what method of notification for the
filesystem
#                                  kernel - use the kernel for notification
#                                  poll - use polling for notification
#                                  none - don't use any notification
#
#                                  the poll_limit is the number of seconds
#                                  that must pass before a resource is polled again.
#                                  It is optional, and if it is not present the
previous
#                                  value will be used or the default.

fsset ext3 poll 10                 # use polling on nfs mounts and poll once
every 10 seconds
fsset nfs poll 10

---

Gordon

[1] http://ubuntuforums.org/showthread.php?t=210329&highlight=gam_server


Comment 3 Alexander Larsson 2007-05-21 08:43:00 UTC
So, your workaround is to switch from using the kernel notification mechanism to
using polling. Not ideal, but at least it works for you atm.

It would be interesting to get a backtrace of a gamin using 100% cpu with
gamin-debuginfo installed.

Also, did it work with the previous (0.1.7-1.1.EL4) version, or what did you
upgrade from?

Are all of you using NFS a lot?

Comment 4 Gordon Farquharson 2007-05-21 16:32:33 UTC
I'll try to get you the backtrace as soon as possible.

I didn't have any problems that I noticed with the previous version
(0.1.7-1.1.EL4). My system was updated by up2date when 4.5 was released.

I do have nfs mounts:

> df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                       64G   29G   32G  48% /
/dev/sda3              99M   26M   68M  28% /boot
none                  505M     0  505M   0% /dev/shm
xxx.xxx.xxx.xxx:/export/xxx
                      826G  276G  542G  34% /h/xxx
xxx.xxx.xxx.xxx:/export/system/csoft_lnx
                      230G   46G  184G  20% /net/csoft_lnx
xxx.xxx.xxx.xxx:/export/system/opt_lnx
                      230G   46G  184G  20% /net/opt_lnx


Comment 5 Alexander Larsson 2007-05-22 06:47:24 UTC
I'll try to reproduce this here. However, anyone that gets this in a production
environment i recommend backing down to 0.1.7-1.1.EL4 for now (unless you want
to help debug this). The fix in -1.2 was to help configuring poll times, so it
is not critical to most people.

Comment 6 Alexander Larsson 2007-05-22 09:36:46 UTC
I wasn't able to reproduce this in some stress-testing here, using both NFS and
ext3 filesystems. 

Here is some more information that would be nice to have from a system where
this is happening:

1) A strace of the running gam_server
2) A debug log from gamin:
   send SIGUSR2 to gam_server, wait a while, then send it again, and the debug log
   should be in /tmp/gamin_debug_*
3) Kill nautilus and see if this helps

Also, the workaround /etc/gaminrc is does two things
a) it turns on polling on ext3 filesystems (fsset ext3 poll 10)
b) it increases the poll rate from the default 5 secs to 10 secs (fsset nfs poll 10)

It would be interesting to know which of these two changes work around the
problem, or if both are actually needed.


Comment 7 Tom Sightler 2007-05-22 14:18:22 UTC
We actually ran an strace on several of our hung gam_server processes last week
and the output was absolutely nothing.  It looked like a completely stuck/hung
process, not output from strace at all.

I'll try the debug log if/when it happens again.  Interestingly, we had it
happen multiple times last week on multiple systems, but so far this week not a
trace.  All of the systems affected did have moderately large (~500GB) ext3
filesystems with ~500,000 files.  One had no NFS at all, the others mounted the
large filesystem via NFS.

Killing nautilus did nothing for the hung process.  As part of our
troubleshooting on the issue, we actually forced all users to logout and
reverted to init level 3 and verified that all other GUI components unloaded. 
The hung gam_server processes stuck around until will killed them.


Comment 8 Tom Sightler 2007-05-22 19:21:52 UTC
OK, got another hung one today.  This is on a system with no NFS at all, just
ext3 filesystems on LVM volumes.  Running strace gives me this:

  # strace -p 18573                  
  Process 18573 attached - interrupt to quit


That's it, no other output at all.  Sending SIGUSR2 to attempt to produce a
debug log also produces no output.  I can produce debug output on a "normal"
gam_server process.  It seems the process is simply spinning doing absolutely
nothing.

What else can I provide?




Comment 9 Tom Sightler 2007-05-22 19:23:17 UTC
Don't know if this will help but here is a backtrace of my currently hung
gam_server process:

#0  0x0805398c in node_remove_directory_subscription (node=0x82ac958,
sub=0x82ba670) at gam_poll_dnotify.c:423
#1  0x08053ee9 in gam_poll_dnotify_remove_subscription_real (sub=0x82ba670) at
gam_poll_dnotify.c:565
#2  0x0805400a in gam_poll_dnotify_remove_subscription (sub=0x82ba670) at
gam_poll_dnotify.c:603
#3  0x0804b4e2 in gam_poll_remove_subscription (sub=0x82ba670) at gam_server.c:498
#4  0x08052d1e in gam_dnotify_remove_subscription (sub=0x82ba670) at
gam_dnotify.c:430
#5  0x0804b40a in gam_kernel_remove_subscription (sub=0x82ba670) at gam_server.c:457
#6  0x0804b160 in gam_remove_subscription (sub=0x82ba670) at gam_server.c:251
#7  0x0804a9b4 in gam_listener_free_subscription (listener=0x829af38,
sub=0x82ba670) at gam_listener.c:99
#8  0x0804aa2b in gam_listener_free (listener=0x829af38) at gam_listener.c:128
#9  0x0804f6ea in gam_connection_close (conn=0x829b220) at gam_connection.c:119
#10 0x0804f0c5 in gam_client_conn_shutdown (source=0x8297aa0, conn=0x829b220) at
gam_channel.c:686
#11 0x0804ecdb in gam_client_conn_read (source=0x8297aa0, condition=17,
info=0x829b220) at gam_channel.c:270
#12 0x00b68907 in g_vasprintf () from /usr/lib/libglib-2.0.so.0
#13 0x00b4474b in g_main_context_dispatch () from /usr/lib/libglib-2.0.so.0
#14 0x00b461d2 in g_main_context_acquire () from /usr/lib/libglib-2.0.so.0
#15 0x00b4647f in g_main_loop_run () from /usr/lib/libglib-2.0.so.0
#16 0x0804b8e3 in main (argc=1, argv=0xbff64af4) at gam_server.c:647
#17 0x00669de3 in __libc_start_main () from /lib/tls/libc.so.6
#18 0x0804a355 in _start ()


Comment 10 Tom Sightler 2007-05-22 19:29:40 UTC
Using gdb I told the node_remove_directory_subscription to return to it's caller
and once I did this my stuck gam_server process began running again.  An strace
of the gam_server process looks "normal" (alternate poll and gettimeofday)
system calls, and the process eventually exited normally (it's session had
already ended).

Hopefully this provides a hint about the problem.


Comment 11 Alexander Larsson 2007-05-23 10:08:37 UTC
So, it seems that it got stuck in the "for (l = children; l; l = l->next)" loop
in node_remove_directory_subscription. However, that should only happen if that
list was circular, but its a local copy of a list that was copied in
gam_tree_get_children(), and since that copy succeeded it should not be circular.

Can someone verify that we're stuck in that particular loop by taking a bunch of
backtraces. I.E. connect to gdb, get a backtrace, run "c", wait a while, press
ctrl-c, get a new backtrace and loop.

If it turns out that all the backtraces are in
node_remove_directory_subscription it would be nice to see exactly what its
doing. Just run "n" a lot of time and post the log of that. 

Comment 12 Tom Sightler 2007-05-23 13:14:50 UTC
(In reply to comment #5)
> I'll try to reproduce this here. However, anyone that gets this in a production
> environment i recommend backing down to 0.1.7-1.1.EL4 for now (unless you want
> to help debug this). The fix in -1.2 was to help configuring poll times, so it
> is not critical to most people.

I'm not sure that most of us normal users have access to 0.1.7-1.1.EL4 because,
as far as I know, it was never part of an official release.  It appears that
prior to RHEL4 U5 the previous version was 0.1.1-4.EL4 and I'm sure most of us
were running that version before the update to U5.

Even if we had access to 0.1.7-1.1.EL4 it seems far more likely the bug was
introduced in the move from 0.1.1 to 0.1.7 than in the small changes to polling
(I'll admit to saying this wihtout having even looked at the changes set between
0.1.1 and 0.1.7 so I may be speaking foolishly here).

I'll try to get the gdb information you asked for on my next hang.


Comment 13 Alexander Larsson 2007-05-23 15:35:58 UTC
Oh, sorry, i didn't know that 0.1.7-1.1 wasn't out. Yes, seems more likely the
bug was introduced in that.

Comment 14 sa@tmt.ca.boeing.com 2007-05-24 17:05:52 UTC
We have two systems that we recently updated to RHEL4 U5 that is seeing
this same problem.  There is a core file that gets created in the
users top level directory prior to gam_server going into a loop.
We're trying the /etc/gaminrc workaround suggested above on one of
the systems to see if it'll help, but we'll leave the other one
unmodified and try strace/gdb backtrace to see if we can get any
useful output.  I can provide a modified "strings" listing from the
core files if that would be helpful.

Comment 15 Bjorn S. Nilsson 2007-06-05 09:38:47 UTC
FWIW, on RedHat 4 update 5 the workaround file mentioned above seems to
cure the problem if located in /etc/gamin/gaminrc only. Cf.
 strings /usr/libexec/gam_server | grep gaminrc


Comment 16 HMDC Linux Support 2007-06-05 13:49:28 UTC
I'll also report that the latest gamin from gnome (0.1.8) does not solve the
issue either.  

Anecdotally, we had been running a similar version of gamin on another server
without issue before RHEL4u5.  Could this be a combination of kernel change and
gamin version update?

Comment 17 ritz 2007-06-06 22:47:55 UTC
a similar bug bugzilla #132354 from upstream .

Comment 18 Alexander Larsson 2007-06-11 07:55:17 UTC
I wonder what is common between these systems exhibiting this bug. Do you have
large filesystems?

Comment 19 Dimitrios Gerasimatos 2007-06-11 08:33:02 UTC

Our systems that have seen this bug have been regular desktops with no unusual
hardware or very large local filesystems. They do mount a Netapp via NFS which
has some filesystems of up to maybe 500 GB, however that was never an issue
previously. They typically are single user systems that mount the Netapp and the
primary function is software development (nothing too intense like Oracle).
These systems are also running AFS. I'm not sure if many other people are
running that, but it's worth mentioning. 

Comment 20 sa@tmt.ca.boeing.com 2007-06-11 15:37:55 UTC
The /etc/gamin/garminrc workaround previously suggested does not seem to work in
our environment as we had another runaway gam_server on a system with this in
place.  A strace while the process is consuming 100% CPU produces no information
(whereas when it is running normally it is quite chatty).

Our environment does employs roughly 5 static mounts and occasional automounts
with the largest remote filesystem being only 1Tb in size.  Local disks (2) are
only 160gb.

Lots of segfaults of gam_server being recorded in the messages file (see below):

May 23 15:25:42 abc kernel: gam_server[11549]: segfault at 000000000000008f rip
000000000040c61f rsp 0000007fbfffdd20 error 4
May 24 08:23:42 abc kernel: gam_server[24522]: segfault at 0000000000000060 rip
000000000040c61f rsp 0000007fbfffdd30 error 4
May 24 13:23:57 abc kernel: gam_server[25665]: segfault at 000000000000002b rip
0000003cc07234f4 rsp 0000007fbfffddc0 error 4
May 24 17:58:06 abc kernel: gam_server[12894]: segfault at 0000000000001d9b rip
0000003cc0723d56 rsp 0000007fbffff1b0 error 4
May 24 17:59:42 abc kernel: gam_server[15965]: segfault at 000000000000008f rip
000000000040c61f rsp 0000007fbfffda70 error 4
May 24 18:48:42 abc kernel: gam_server[15997]: segfault at 0000000000000049 rip
000000000040c61f rsp 0000007fbfffdd20 error 4
May 24 19:53:42 abc kernel: gam_server[16692]: segfault at 0000000000000083 rip
000000000040c61f rsp 0000007fbfffda70 error 4
May 24 20:35:42 abc kernel: gam_server[17255]: segfault at 000000000000003b rip
000000000040c61f rsp 0000007fbfffdd30 error 4
May 25 01:06:42 abc kernel: gam_server[17614]: segfault at 000000000000008e rip
000000000040c61f rsp 0000007fbfffdd30 error 4
May 25 08:57:10 abc kernel: gam_server[28073]: segfault at 000000000000001a rip
0000003cc07234f4 rsp 0000007fbfffdbb0 error 4
May 25 12:01:42 abc kernel: gam_server[28908]: segfault at 000000000000008e rip
000000000040c61f rsp 0000007fbfffdd20 error 4
May 25 13:10:42 abc kernel: gam_server[2178]: segfault at 000000000000008c rip
000000000040c61f rsp 0000007fbfffdd30 error 4
May 25 14:06:42 abc kernel: gam_server[2777]: segfault at 000000000000003f rip
000000000040c61f rsp 0000007fbfffdd20 error 4
May 25 14:52:42 abc kernel: gam_server[3991]: segfault at 000000000000007e rip
000000000040c61f rsp 0000007fbfffda70 error 4
May 25 15:56:42 abc kernel: gam_server[4513]: segfault at 0000000000000058 rip
000000000040c61f rsp 0000007fbfffdd30 error 4
Jun  4 12:04:47 abc kernel: gam_server[13024]: segfault at 0000000000000059 rip
000000000040c61f rsp 0000007fbfffda60 error 4
Jun  4 15:46:47 abc kernel: gam_server[15938]: segfault at 0000000000000087 rip
000000000040c61f rsp 0000007fbfffdd20 error 4
Jun  4 16:47:47 abc kernel: gam_server[18364]: segfault at 000000000000008b rip
000000000040c61f rsp 0000007fbfffdd20 error 4
Jun  5 13:41:07 abc kernel: gam_server[29176]: segfault at 0000000000000046 rip
000000000040c61f rsp 0000007fbfffda60 error 4
Jun  5 15:13:07 abc kernel: gam_server[751]: segfault at 0000000000000041 rip
000000000040c61f rsp 0000007fbfffdd20 error 4
Jun  5 17:09:07 abc kernel: gam_server[1937]: segfault at 0000000000000050 rip
000000000040c61f rsp 0000007fbfffdd30 error 4
Jun  7 16:29:01 abc kernel: gam_server[27451]: segfault at 0000000000000027 rip
0000003cc07234f4 rsp 0000007fbfffd7b0 error 4


Comment 21 sa@tmt.ca.boeing.com 2007-06-11 17:09:21 UTC
Created attachment 156740 [details]
Snipit of strace output monitoring runaway gam_server

Comment 22 sa@tmt.ca.boeing.com 2007-06-11 17:12:02 UTC
Just had another runaway gam_server on the same system.  This time I was able to
get some strace info (see previous attachment).  As you can see it seems to be
looping in the user's .kde/share/config subdirectory, rather than checking other
directories when the application is running normally.  Only files in that
subdirectory that changed today are: kdeglobals, korgacrc, kmailrc, uiserverrc,
korganizerrc, emaildefaults. No segfault this time nor core file. 

Comment 23 Gordon Farquharson 2007-06-11 20:03:49 UTC
(In reply to comment #4)

> I do have nfs mounts:
> 
> > df -h
> Filesystem            Size  Used Avail Use% Mounted on
> /dev/mapper/VolGroup00-LogVol00
>                        64G   29G   32G  48% /
> /dev/sda3              99M   26M   68M  28% /boot
> none                  505M     0  505M   0% /dev/shm
> xxx.xxx.xxx.xxx:/export/xxx
>                       826G  276G  542G  34% /h/xxx
> xxx.xxx.xxx.xxx:/export/system/csoft_lnx
>                       230G   46G  184G  20% /net/csoft_lnx
> xxx.xxx.xxx.xxx:/export/system/opt_lnx
>                       230G   46G  184G  20% /net/opt_lnx

The first NFS filesystem (/export/xxx) is ext3, and the second
(/export/system/csoft_lnx) and third (/export/system/opt_lnx) NFS filesystems
are gfs.

Gordon


Comment 24 sa@tmt.ca.boeing.com 2007-06-12 00:04:58 UTC
Created attachment 156769 [details]
Debug output from kill -SIGUSR2 of runaway gam_server

Just had another 'event'; same machine, same user.  Attached is output from a
kill -SIGUSR2 on the runaway gam_server.  Hopefully this will have some useful
information.

Any suggestions/additional workarounds would be appreciated (a fix would be
even better), as it is getting old having to keep killing the gam_server (we
may end up nuking it if this keeps up).

Thanks!

Comment 25 Tom Sightler 2007-06-12 03:47:40 UTC
The posted workaround does not work for me either.  The system we have that is 
experiencing this issue the most is has no NFS mounts but is a NFS server and 
exports a 300GB ext3 filesystem.  Not huge certainly but not little either.

I have attached the requested gdb output.

Thanks,
Tom



Comment 26 Tom Sightler 2007-06-12 03:48:43 UTC
Created attachment 156776 [details]
GDB output from stuck gam_server

Comment 27 Tom Sightler 2007-06-12 03:57:37 UTC
> Any suggestions/additional workarounds would be appreciated (a fix would be
> even better), as it is getting old having to keep killing the gam_server (we
> may end up nuking it if this keeps up).

Have you tried downgrading to the RHEL4 U4 version?  We don't seem to have any 
problems when we run that version, although maybe it's just good luck.  We've 
temporarily blacklisted it on the systems where we're having the problems 
although I had to put it back on one of the system to provide information to 
Redhat support.


Comment 28 Alexander Larsson 2007-06-15 08:39:02 UTC
There are two problems here. One is tom getting a real 100% cpu use loop, i.e.
running strace on it shows nothing. 

The other problem is gamin doing lots of polling.  This has been observed
before, and seems to be related to something kde does (only happens with kde). I
believe this is due to gamin doing polling to often on NFS, and can be fixed by
configuring gamin to poll less often. In fact, the reason for this update was to
add a way to globally set this option so that such problems could be fixed.

Comment 29 Alexander Larsson 2007-06-15 09:00:47 UTC
Looking at the last few backtraces from tom it really looks like its stuck in
the for loop in node_remove_directory_subscription(). However, I still can't
figure out how that could ever happen...

Comment 30 Jan Iven 2007-06-15 09:18:29 UTC
I would agree with #28 - there may be two bugs in here, both cause 100%CPU on
gam_server, one while endlessly stat()ing some file (possible related to NFS),
one in the for() loop. Suggest to split this..


Comment 31 Alexander Larsson 2007-06-18 08:28:22 UTC
I created bug 244633 for the case where strace shows a lot of polling. This bug
is used for the "strace shows nothing" case.



Comment 32 Alexander Larsson 2007-06-19 07:28:33 UTC
I ran an overnight stress test of gamin, but couldn't reproduce this. It might
need some specific condition to trigger. Any clues gladly accepted.

Comment 33 Alexander Larsson 2007-06-19 10:33:59 UTC
tom: Any chance you could verify that the children list in
node_remove_directory_subscription() is/is not circular.

When you break inside node_remove_directory_subscription, do something like
p children
p children->next
p children->next->next
p children->next->next->next

etc, until you find a next pointer that is null, or one that has already been
printed.

Comment 34 Alexander Larsson 2007-06-19 11:28:51 UTC
This is a long-shot, bug could people test this version:

http://people.redhat.com/alexl/RPMS/gamin/gamin-0.1.7-1.2.EL4.0signalsafe.src.rpm

It fixes use of malloc (which is signal unsafe) in a signal handler. It doesn't
seem likely that this would cause circular lists, but its possible. Testing will
see.

Comment 35 Alexander Larsson 2007-06-20 09:24:05 UTC
Sorry, that had a bug, please try instead this updated version:
http://people.redhat.com/alexl/RPMS/gamin/gamin-0.1.7-1.2.EL4.0signalsafe.2.src.rpm


Comment 36 Steve Rigler 2007-06-21 12:16:11 UTC
I've been running both of the updated gamin packages that Alex posted and
haven't seen any problems yet.  With the previous version I would see gam_server
using 100% CPU time after a couple of hours.

Comment 37 Alexander Larsson 2007-06-26 09:08:06 UTC
Has anyone else tested the package i posted?

Comment 38 Tom Sightler 2007-06-26 13:25:52 UTC
I have downloaded both your updated package and a different one provided by
Redhat support, but so far have not tested either of them.  For whatever reason
I haven't seen the problem in the last two weeks.  I can't really explain that
as there have been no changed to the systems in question and previously we were
seeing the issue almost daily.

Anyway, I'm still willing to try the package you've posted here, but I'm not
sure how valuable a success report would be since we're suddenly not seeing the
problem.

Comment 39 sa@tmt.ca.boeing.com 2007-06-26 16:13:39 UTC
We've been testing it on 3 systems and I just installed it on a 4th this morning.
The one system that was experiencing the problem the most frequently had 4
failures in one day (core dumps) prior to installing your 2nd release, but none
since then.  So far, none of the systems have observed the problem with your 2nd
release installed.

As others have observed, we have some RHEL 4.5 systems that have never
experienced this problem and others that either have it frequently or only once
since our upgrades, so it's hard to know what triggers the race condition.  It
is encouraging, however, that with signalsafe.2 we've been running event-free,
which is promising.

Comment 40 Mark London 2007-07-16 13:56:57 UTC
We have seen this bug also here, as users home directories are on NFS served 
disks.  Is this patch going to be released soon as an official upgrade, or 
should I install the patched version instead?

Comment 41 Sam Sharpe 2007-07-17 10:58:53 UTC
I've just had this reported to me for the first time (and we have 850+ machines 
running RHEL4). What's interesting is that this particular machine is an EL3-
>EL4 upgrade, which is not a particularly common configuration for us (mostly 
we force a reinstall)

Comment 42 Tim Berger 2007-07-19 16:52:14 UTC
Just like to chime in here to thank you for the signalsafe.2 update.  It did the
trick for us as well (~90 host deployment).  We were seeing the same problems. 
CPU being pinned and regular segfaults.

Comment 43 Thomas Walker 2007-07-20 18:44:58 UTC
  I installed the signalsafe.2 update also on an EL4 machine which could easily
reproduce the runaway process and so far it's good.  I'd say get it released
into the RHN channel as a fix right away.  Can't be any worse than the package
they have in there now.

  Thomas Walker

Comment 45 RHEL Program Management 2007-07-24 13:49:30 UTC
This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.

Comment 46 Tom Sightler 2007-07-24 14:10:34 UTC
As another data point, back in late June we had another few days where we had
gamin runaway processes several times a day on a few servers so we went ahead
and installed the signalsafe.2 packages.  We've been running these packages for
the entire month of July and we haven't experienced this issue again so it
certainly seems this patch is an improvement for us.

Thanks,
Tom

Comment 53 Matthias Clasen 2007-07-30 14:08:40 UTC
Built as gamin-0.1.7-1.3.EL4

Comment 55 Issue Tracker 2007-08-02 18:44:24 UTC
On upgrading to signalsafe rpms, gam_server uses 100% cpu intermittently.


This event sent from IssueTracker by rkhadgar 
 issue 128235

Comment 56 Issue Tracker 2007-08-10 18:56:33 UTC
From gamin debug logs

Exclude: use POLL on /media/*
Exclude: use POLL on /mnt/*
Exclude: use POLL on /dev/*
Exclude: use POLL on /proc/*
fstype ext3 monitor kernel poll timeout 0
ext3 filesystem mounted at /featemp
...
Poll: poll_file for /featemp/mcrogers/ACAS/Job-1.cid called
...

From gamin strace 

6812  09:28:24 stat("/featemp/mcrogers/ACAS/Job-1.inp", 0x7fbfffdfc0) =
-1 ENOENT (No such file or directory)
...


It seems gamin was probably stuck in a loop trying to stat for
non-existant file.



This event sent from IssueTracker by rkhadgar 
 issue 128235
it_file 98991

Comment 59 Mark London 2007-08-31 16:53:04 UTC
I compiled and installed:

http://people.redhat.com/alexl/RPMS/gamin/gamin-0.1.7-
1.2.EL4.0signalsafe.2.src.rpm

However, while this seems to have helped, I still ran into a looping 
gam_server.  The backtrace for it always showed:

(gdb) backtrace
#0  0x00000032146b94a5 in _xstat () from /lib64/tls/libc.so.6
#1  0x0000000000410276 in gam_poll_dnotify_poll_file (node=0x5768d0) at 
gam_poll_dnotify.c:206
#2  0x00000000004049a5 in gam_poll_file (node=0x5768d0) at gam_server.c:505
#3  0x000000000040759b in gam_poll_generic_scan_directory_internal 
(dir_node=0x52fc50)
    at gam_poll_generic.c:349
#4  0x00000000004076e3 in gam_poll_generic_scan_directory (path=0x53db60 "/etc")
    at gam_poll_generic.c:399
#5  0x000000000040fa7d in gam_dnotify_pipe_handler (user_data=0x51b330) at 
gam_dnotify.c:315
#6  0x0000003216b266bd in g_main_context_dispatch () from /usr/lib64/libglib-
2.0.so.0
#7  0x0000003216b28397 in g_main_context_acquire () from /usr/lib64/libglib-
2.0.so.0
#8  0x0000003216b28735 in g_main_loop_run () from /usr/lib64/libglib-2.0.so.0
#9  0x0000000000404c37 in main (argc=1, argv=0x7fbfffb498) at gam_server.c:584


Comment 60 Mark London 2007-08-31 17:04:08 UTC
Sorry, I just figured out the cause of my previous report.  There was a samba 
mount point in a hung state, which was also on the Desktop.  If someone tried 
to access that mount point, i.e. by doing a df command, that would cause 
gam_server to start looping.



Comment 61 Gordon Farquharson 2007-09-11 22:34:59 UTC
Hi All

I have been running gamin-0.1.7-1.2.EL4.0signalsafe.2 for over a month, and have
not seen this problem occur again.

Gordon

Comment 66 Jonathan Blandford 2007-09-25 13:57:20 UTC
*** Bug 244694 has been marked as a duplicate of this bug. ***

Comment 68 Stephan Jourdan 2007-11-15 10:26:04 UTC
Hi all,

I've been running this patch on several workstations now and it looks like the
problem has been solved.

Any idea when the official patch will be released ?

Cheers,
Stef

Comment 69 errata-xmlrpc 2007-11-15 16:04:36 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0772.html