Bug 1385040
Summary: | gam_server crashing repeatedly | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Joe Wright <jwright> | ||||||||||||||||
Component: | gamin | Assignee: | Ondrej Holy <oholy> | ||||||||||||||||
Status: | CLOSED WONTFIX | QA Contact: | Desktop QE <desktop-qa-list> | ||||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||||
Priority: | unspecified | ||||||||||||||||||
Version: | 6.8 | CC: | aiyengar, alanm, cww, derfian, dkaylor, greg.matthews, jwright, oholy, rick.beldin, sydelko, walters, wbaudler | ||||||||||||||||
Target Milestone: | rc | ||||||||||||||||||
Target Release: | --- | ||||||||||||||||||
Hardware: | x86_64 | ||||||||||||||||||
OS: | Linux | ||||||||||||||||||
Whiteboard: | |||||||||||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||
Last Closed: | 2017-08-15 20:08:19 UTC | Type: | Bug | ||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||
Embargoed: | |||||||||||||||||||
Attachments: |
|
Description
Joe Wright
2016-10-14 15:39:49 UTC
It looks like the following Fedora bug, however, all the relevant upstream patches should be already part of this version: https://bugzilla.redhat.com/show_bug.cgi?id=205731 They can also try to add: "fsset nfs4 none" We already configured /etc/gamin/gaminrc with those parameters prior to filing this bug. No success. Polling might happen in certain conditions, but inotify code should not be called if "/home/pier/e/" is nfs mount and gaminrc file contains "fsset nfs none" and "fsset nfs4 none", see: https://git.gnome.org/browse/gamin/tree/server/gam_server.c#n185 Maybe the "/etc/gamin/gaminrc" is overwritten by another gaminrc file, because it has the lowest priority. You can try "/etc/gamin/mandatory_gaminrc" instead which should have the highest priority, see: https://people.gnome.org/~veillard/gamin/config.html If it doesn't help, can you please provide content of all your gaminrc files (i.e. "/etc/gamin/gaminrc", "/etc/gamin/mandatory_gaminrc", "~/.gaminrc") and "/etc/mtab" file? Created attachment 1212167 [details]
/etc/gamin/gaminrc file
Created attachment 1212168 [details]
/etc/gamin/mandatory_gaminrc file
Created attachment 1212169 [details]
/etc/mtab
Thanks for the data, it looks ok, let's try something else. Can you please provide debugging output for the crashed gam_server, please? Unfortunately, it is a bit tricky: 1) mv /usr/libexec/gam_server /usr/libexec/gam_server.bak 2) create /usr/libexec/gam_server with the following content: #!/bin/sh export GAM_DEBUG=1 exec /usr/libexec/gam_server.bak --notimeout &> /tmp/gamin-debug-$$.log 3) chmod +x /usr/libexec/gam_server 4) pkill gam_server 5) try to reproduce the crash 6) provide /tmp/gamin-debug-<PID OF THE CRASHED GAM_SERVER>.log Created attachment 1212518 [details]
gamin debug log, PID 5248
Created attachment 1212520 [details]
gamin debug log, PID 5294
Created attachment 1212521 [details]
kernel segfault messages, PID 5248, 5294, 5365, 5384, 6427
I've attached 2 log files, I have 3 more that are pretty much identical. Let me know if you want those too. --andy. Thanks for the logs, the additional ones are not needed. It seems that most of the requests are ignored, but sometimes some of them are not ignored: # This was not ignored: MONDIR request: from /usr/libexec/gvfsd-trash, seq 1, type 2 options 10 /usr/libexec/gvfsd-trash listening for /package/sage g_a_s: /package/sage using kernel monitoring Adding sub /package/sage to listener /usr/libexec/gvfsd-trash # This was ignored: MONDIR request: from /usr/libexec/gvfsd-trash, seq 2, type 2 options 10 /usr/libexec/gvfsd-trash listening for /package/sage # Mount list update happens usually around those messages Updating list of mounted filesystems So it seems that there is a race somewhere... does the list of NFS mounts changes in time? Just a note that this is obviously side-effect of Bug 725178, because FAM is currently used for monitoring NFS filesystems, but was not before RHEL 6.8... Yes, quite often. /home and /package (and more) are automount spaces with lots of NFS mounts possible if someone traverses into them. I had actually asked about that bug change possibility on the initial support case but was shot down. Hmm, the mtab changes might be the root cause of those crashes. However, still, I am not able to reproduce it. Does automount works correctly for you? It seems to me that once gam_server get monitoring request, then it is not possible to unmount the nfs share, because of "device is busy". I did not test it on RHEL 6 before, I will make new tests on RHEL 6... I'm finally able to reproduce the crashes on RHEL 6.8. The following seems to be enough to reproduce the crashes: 1/ configure gamin: /etc/gamin/mandatory_gaminrc: fsset nfs none fsset nfs4 none pkill gam_server 2/ configure autofs: /etc/auto.master: /misc /etc/auto.misc --timeout=5 /etc/auto.misc: m0 -fstype=nfs ADDRESS m1 -fstype=nfs ADDRESS m2 -fstype=nfs ADDRESS m3 -fstype=nfs ADDRESS m4 -fstype=nfs ADDRESS service autofs reload 3/ run the following: for i in $(seq 0 4); do ls /misc/m$i; sleep 1; done Created attachment 1214570 [details]
valgrind
There is also related valgrind output...
Just a note that the steps from Comment 17 works reliably only if gam_server binary is replaced by the script (Comment 9) and the binary is spawned under valgrind: exec valgrind --log-file=/tmp/gamin-valgrind-$$.log --leak-check=full --track-origins=yes /usr/libexec/gam_server.bak --notimeout &> /tmp/gamin-debug-$$.log More effort (e.g. more mountpoints) is needed in order to reproduce this with unmodified gam_server. The crucial problem is that the nfs mountpoints are handled as local filesystem sometimes (when unmounted) and sometimes as nfs (when mounted). Consequently, polling or none is used sometime and inotify is used for the same dir another time. E.g. client subscribes inotify monitoring, but it doesn't unsubscribe them. I am looking for a way how to deal with it... I see the following workarounds for autofs mounts (polling is not possible, because it blocks unmounting): 1) Always use inotify - This should offer more or less same behavioral as it was before RHEL 6.8 (at least for glib based applications). So you should get notifications about your changes on nfs, but not changes made from network. I think this is best what we can do on nfs if autofs is used: fsset nfs kernel 2) Always use none - We can disable monitoring over gamin at all, but I don't think it is good idea if home dirs are on nfs. This should not affect local filesystem, because glib uses its own inotify monitor instead (at least for glib based applications): fsset ext4 none fsset nfs none (This example presumes that local filesystem is ext4.) Let me know if it helps to you. I tried fsset nfs kernel and the load shot from 4 to 60+, gam_server for every user was suddenly taking lots of CPU trying to go through top level directories on every NFS mount Using: fsset ext4 none fsset nfs none seemed to be better initially. It seems to go through cycles where it will cause the load to swing to 30+ and then come back down again, maybe every 5-10 minutes. I can't tell if it's time based or based on something like automount mounts coming or going. Neither of these options are a whole lot better than what we've seen so far. (In reply to Andrew Sydelko from comment #21) > I tried > > fsset nfs kernel > > and the load shot from 4 to 60+, gam_server for every user was suddenly > taking lots of CPU trying to go through top level directories on every NFS > mount I suppose that glib file monitoring (which was used before RHEL 6.8) is less demanding, however, the load is distributed into several processes... > Using: > > fsset ext4 none > fsset nfs none > > seemed to be better initially. It seems to go through cycles where it will > cause the load to swing to 30+ and then come back down again, maybe every > 5-10 minutes. I can't tell if it's time based or based on something like > automount mounts coming or going. I suppose that this is caused by automounts, but you can provide gamin-debug-?.log to see what is happening... > Neither of these options are a whole lot better than what we've seen so far. I am looking for fix for the crashes, however, the default gamin behavior doesn't help to you, because polling prevents autofs unmounts and doesn't have lesser load... Maybe, GIO_USE_FILE_MONITOR could be backported in order to avoid gamin usage and reduce the load... I don't suppose I can downgrade glib to the RHEL 6.7 version where these problems didn't exist? You can probably do it as a workaround temporary, but be careful! The following document should show you an official way: https://access.redhat.com/solutions/29617 I've manually downgraded to glib2-2.28.8-4.el6 version and it seems it works properly. Another workaround is to just remove/rename the following library: /usr/lib/gio/modules/libgiofam.so This should remove FAM support from GLib, so it should work same as before, but with the latest GLib... this is similar to what could be achieved using GIO_USE_FILE_MONITOR if it is backported. I am looking for a way how to fix the crashes, however, if you insist on the previous behavioral, you should file another bug report for GLib in order to backport of GIO_USE_FILE_MONITOR env variable, or provide another solution... Colin, can you please take a look at this? Bug 1399726 has been filed in order to revert previous GLib behavior. (In reply to Ondrej Holy from comment #20) > (snip) > > 2) Always use none - We can disable monitoring over gamin at all, but I > don't think it is good idea if home dirs are on nfs. This should not affect > local filesystem, because glib uses its own inotify monitor instead (at > least for glib based applications): > > fsset ext4 none > fsset nfs none It should be enough to use: fsset autofs none fsset nfs none See: https://bugzilla.redhat.com/show_bug.cgi?id=1399726#c7 *** Bug 1388909 has been marked as a duplicate of this bug. *** Let me know if this is still an issue with RHEL 6.9 (or patch from Bug 1399726). yes, still an issue with 6.9. Do you use gamin for something explicitly, or use some special software, or environment? Doesn't the gamin configuration from Comment 29 help to you? I wonder whether some other project started using gamin in RHEL 6.8, because this wasn't reported before RHEL 6.8... thanks for the reply. I think I made a mistake in comment 38. The host I saw it on didn't yet have the glib2-2.28.8-9 version. I've just rectified that and am monitoring the logs. however, the configuration in comment 29 does not prevent the gam_server crashes. G (In reply to Greg Matthews from comment #40) > thanks for the reply. I think I made a mistake in comment 38. The host I saw > it on didn't yet have the glib2-2.28.8-9 version. I've just rectified that > and am monitoring the logs. Ok, thanks for the comment. > however, the configuration in comment 29 does not prevent the gam_server > crashes. Hmm, can you please provide output from "mount" command? I'm starting to have trouble finding hosts that are still displaying this behaviour as we have rolled the recent glib2 packages out. However, those workstations that have not been rebooted since the roll out still show these crashes and this is the output from mount on one of those: [qqs43472@ws148 ~]$ mount /dev/mapper/vg.1-lv_root on / type ext4 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw,rootcontext="system_u:object_r:tmpfs_t:s0") /dev/sda1 on /boot type ext4 (rw) /dev/mapper/vg.1-lv_scratch on /scratch type ext4 (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) cs03r-sc-nas-svm02.diamond.ac.uk:/exports/dls_sw/epics on /dls_sw/epics type nfs (rw,rsize=8192,wsize=8192,intr,soft,sloppy,addr=172.23.100.71) cs04r-nas01-02.diamond.ac.uk:/vol/staff_home/staff-home/pck07289 on /home/pck07289 type nfs (rw,nosuid,rsize=32768,wsize=32768,acl,intr,soft,nfsvers=3,sloppy,addr=172.23.130.7) gvfs-fuse-daemon on /home/pck07289/.gvfs type fuse.gvfs-fuse-daemon (rw,nosuid,nodev,user=pck07289) \\\\diamsanserv01.diamond.ac.uk\\pck07289$ on /scratch/pck07289/U type cifs (rw) dls-sw.diamond.ac.uk:/srv/software/apps/apps on /dls_sw/apps type nfs (rw,rsize=8192,wsize=8192,intr,soft,nfsvers=3,sloppy,addr=172.23.136.33) cs03r-sc-nas-svm02.diamond.ac.uk:/exports/dls_sw/prod on /dls_sw/prod type nfs (rw,rsize=8192,wsize=8192,intr,soft,sloppy,addr=172.23.100.71) cs04r-nas01-02.diamond.ac.uk:/vol/technical/technical/sysadmin/linux on /home/sys-admin type nfs (rw,nosuid,nfsvers=3,acl,rsize=32768,wsize=32768,intr,soft,sloppy,addr=172.23.130.7) \\\\diamsanserv01.diamond.ac.uk\\pck07289$ on /scratch/pck07289/U type cifs (rw) cs04r-nas01-02.diamond.ac.uk:/vol/staff_home/staff-home/qqs43472 on /home/qqs43472 type nfs (rw,nosuid,rsize=32768,wsize=32768,acl,intr,soft,nfsvers=3,sloppy,addr=172.23.130.7) cs03r-sc-nas-svm02.diamond.ac.uk:/exports/dls_sw/etc on /dls_sw/etc type nfs (rw,rsize=8192,wsize=8192,intr,soft,sloppy,addr=172.23.100.71) cs03r-sc-nas-svm02.diamond.ac.uk:/exports/dls/ops_data on /dls/ops-data type nfs (rw,rsize=8192,wsize=8192,intr,soft,nfsvers=3,sloppy,addr=172.23.100.71) mx-scratch.diamond.ac.uk:/mnt/lustre03/mx-scratch on /dls/mx-scratch type nfs (rw,rsize=32768,wsize=32768,acl,intr,soft,nfsvers=3,sloppy,addr=172.23.142.217) dls-attic:/srv/attic on /dls/attic type nfs (rw,rsize=32768,wsize=32768,acl,intr,soft,nfsvers=3,sloppy,addr=172.23.150.3) cs04r-nas01-02.diamond.ac.uk:/vol/staff_home/staff-home/aak24408 on /home/aak24408 type nfs (rw,nosuid,rsize=32768,wsize=32768,acl,intr,soft,nfsvers=3,sloppy,addr=172.23.130.7) i24-storage.diamond.ac.uk:/mnt/gpfs02/i24 on /dls/i24 type nfs (rw,rsize=32768,wsize=32768,intr,soft,acl,nfsvers=3,sloppy,addr=172.23.154.11) i04-storage.diamond.ac.uk:/mnt/gpfs02/i04 on /dls/i04 type nfs (rw,rsize=32768,wsize=32768,intr,soft,acl,nfsvers=3,sloppy,addr=172.23.154.11) i04-1-storage.diamond.ac.uk:/mnt/gpfs02/i04-1 on /dls/i04-1 type nfs (rw,rsize=32768,wsize=32768,intr,soft,acl,nfsvers=3,sloppy,addr=172.23.154.11) p45-storage.diamond.ac.uk:/mnt/gpfs02/p45 on /dls/p45 type nfs (rw,rsize=32768,wsize=32768,intr,soft,acl,nfsvers=3,sloppy,addr=172.23.154.11) cs04r-nas01-02.diamond.ac.uk:/vol/science on /dls/science type nfs (rw,nosuid,nfsvers=3,acl,rsize=32768,wsize=32768,intr,soft,sloppy,addr=172.23.130.7) dls-sw.diamond.ac.uk:/srv/software/apps/dasc on /dls_sw/dasc type nfs (rw,rsize=8192,wsize=8192,intr,soft,nfsvers=3,sloppy,addr=172.23.136.33) cs03r-sc-nas-svm02.diamond.ac.uk:/exports/dls_sw/work on /dls_sw/work type nfs (rw,rsize=32768,wsize=32768,intr,soft,nfsvers=3,sloppy,addr=172.23.100.71) cs04r-nas01-02.diamond.ac.uk:/vol/staff_home/staff-home/zva49823 on /home/zva49823 type nfs (rw,nosuid,rsize=32768,wsize=32768,acl,intr,soft,nfsvers=3,sloppy,addr=172.23.130.7) i02-storage.diamond.ac.uk:/mnt/gpfs02/i02 on /dls/i02 type nfs (rw,rsize=32768,wsize=32768,intr,soft,acl,nfsvers=3,sloppy,addr=172.23.154.11) i03-storage.diamond.ac.uk:/mnt/gpfs02/i03 on /dls/i03 type nfs (rw,rsize=32768,wsize=32768,intr,soft,acl,nfsvers=3,sloppy,addr=172.23.154.11) cs04r-sc-vserv-115:/mnt/lustre03/staging on /dls/staging type nfs (rw,rsize=32768,wsize=32768,intr,soft,acl,nfsvers=3,sloppy,addr=172.23.142.30) cs04r-nas01-02.diamond.ac.uk:/vol/staff_home/staff-home/xfz42935 on /home/xfz42935 type nfs (rw,nosuid,rsize=32768,wsize=32768,acl,intr,soft,nfsvers=3,sloppy,addr=172.23.130.7) cs04r-nas01-02.diamond.ac.uk:/vol/staff_home/staff-home/ktc05079 on /home/ktc05079 type nfs (rw,nosuid,rsize=32768,wsize=32768,acl,intr,soft,nfsvers=3,sloppy,addr=172.23.130.7) m02-storage.diamond.ac.uk:/mnt/gpfs02/m02 on /dls/m02 type nfs (rw,rsize=32768,wsize=32768,intr,soft,acl,nfsvers=3,sloppy,addr=172.23.180.70) i15-storage.diamond.ac.uk:/mnt/gpfs02/i15 on /dls/i15 type nfs (rw,rsize=32768,wsize=32768,intr,soft,acl,nfsvers=3,sloppy,addr=172.23.154.11) m05-storage.diamond.ac.uk:/mnt/gpfs02/m05 on /dls/m05 type nfs (rw,rsize=32768,wsize=32768,intr,soft,acl,nfsvers=3,sloppy,addr=172.23.185.70) cs04r-nas01-02.diamond.ac.uk:/vol/staff_home/staff-home/fer45166 on /home/fer45166 type nfs (rw,nosuid,rsize=32768,wsize=32768,acl,intr,soft,nfsvers=3,sloppy,addr=172.23.130.7) cs04r-nas01-02.diamond.ac.uk:/vol/staff_home/staff-home/kdf51254 on /home/kdf51254 type nfs (rw,nosuid,rsize=32768,wsize=32768,acl,intr,soft,nfsvers=3,sloppy,addr=172.23.130.7) cs04r-nas01-02.diamond.ac.uk:/vol/dls_tmp/dls_tmp on /dls/tmp type nfs (rw,nosuid,nfsvers=3,acl,rsize=32768,wsize=32768,intr,soft,sloppy,addr=172.23.130.7) dls-bl-storage.diamond.ac.uk:/srv/bl-data/i06 on /dls/i06 type nfs (rw,rsize=32768,wsize=32768,acl,intr,soft,nfsvers=3,sloppy,addr=172.23.142.13) [qqs43472@ws148 ~]$ cat /etc/gamin/gaminrc fsset nfs none fsset autofs none Thanks! Ah, there isn't any mount of type autofs. So, the suggested workaround can't work to you. But you use autofs, don't you? It seems that autofs mount is there in some configurations and isn't in other configurations. It should work in your case if you add "fsset ext4 none" in your gaminrc file... yes, all of those nfs mounts are there from autofs. We have not nfs mounts in /etc/fstab for workstations. if I set "fsset ext4 none" then presumably gamin is basically disabled completely right? If you set "fsset ext4 none" and "fsset nfs none", then we can say that monitoring in gamin is completely disabled in your case (given the output from mount command). So, for example, monitoring over GLib would still work on ext4, but not for nfs... |