Bug 510530 (R5.4)
Summary: | autofs-5.0.1-0.rc2.129 (RHEL 5.4[beta] automounter) has memory leak | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | bg <bgbugzilla> | ||||
Component: | autofs | Assignee: | Ian Kent <ikent> | ||||
Status: | CLOSED ERRATA | QA Contact: | BaseOS QE <qe-baseos-auto> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 5.3 | CC: | cward, dkovalsk, ikent, jmoyer, kvolny, rlerch, sghosh, sprabhu, tao, ykopkova, zbrown | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
Previously, the method used by autofs to clean up pthreads was not reliable and could result in a memory leak. If the memory leak occurred, autofs would gradually consume all available memory and then crash. A small semantic change in the code prevents this memory leak from occurring now.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2009-09-02 11:58:53 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
bg
2009-07-09 16:45:26 UTC
(In reply to comment #0) > > Expected results: > automounter works > > Additional info: > We have a very large direct map (8500+ entries) and a large indirect user map > (33,000+). I was able to reproduce issue with both files and ldap as a backend > for the auto.master + associated files. Are you maps simple direct and indirect maps? I've had a report of this upstream but haven't been able to get anywhere with it yet. If you can tell me more about the structure of your maps I'll have another try at reproducing the problem. Here's our auto.master and associated files: $ cat /etc/auto.master # auto.master for autofs5 machines /usr2 auto.home --timeout 60 /- auto.direct --timeout 60 /net -hosts $ cat /etc/auto.direct +/etc/auto.projects /opt/random/src -rw,noquota filerx:/vol/vol0/src /opt/random/doc -rw,noquota filerx:/vol/vol0/doc the direct maps are there because we have multiple layers of nested mounts. so auto.projects might look like the following: /mnt/dir1 -rw,noquota,intr filer1:/vol/subdir/dir1 /mnt/dir1/dir2 -rw,noquota,intr filer2:/vol/subdir1/dir2 /mnt/dir1/dir2/dir3 -rw,noquota,intr filer3:/vol/subdir1/dir2_dir3 /mnt/dir1/dir2/dir3/dir4 -rw,noquota,intr filer4:/vol/subdir1/dir2_dir3_dir4 /mnt/dir1/dir2/dir3/dir4/include -rw,noquota,intr filer5:/vol/subdir1/dir2_dir3_dir4_include homedirs as indirect maps (auto.home) might look like user1 -rw,noquota,intr fileru:/vol/vol2/usr2/user1 user2 -rw,noquota,intr fileru:/vol/vol0/usr2/user2 user3 -rw,noquota,intr fileru:/vol/vol2/usr2/user3 . . . user33000 -rw,noquota,intr fileru:/vol/vol2/usr2/user33000 Created attachment 351216 [details]
Patch that attempts to fix suspected semantic problem with pthread_cleanup_push()
I'm not sure this fixes the memory leak issue but testing
looked promising. I'll get a scratch build done with this
patch included.
Please try the scratch build at: http://people.redhat.com/~ikent/autofs-5.0.1-0.rc2.129.bz510530.1. On my ldap-sourced machine -- it looks really great right now. cpu cycles seem under control and the ram is looking good after 10 minutes. $ ps aux |grep auto USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 30586 0.4 1.2 120216 11388 ? Ssl 07:46 0:02 automount Something I hadn't mentioned before was the extreme amount of cpu it was taking up on the files-based map. This is still happening after this release. (I can open a separate bz if you'd like) $ date Fri Jul 10 07:57:48 PDT 2009 $ ps aux |grep auto USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 2114 98.6 2.2 170308 20424 ? Ssl 07:41 6:23 automount $ date Fri Jul 10 07:57:51 PDT 2009 In just ~16 minutes of running it's already used over 6 minutes of cpu. This is with the new patched version as well. The memory has stayed put at 2.3 or below% however. That's looking MUCH better so far. I'll let this run today and continue to monitor it. top - 08:00:14 up 1 day, 5:25, 1 user, load average: 0.92, 0.89, 0.68 Tasks: 103 total, 3 running, 100 sleeping, 0 stopped, 0 zombie Cpu(s): 30.0%us, 70.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 895464k total, 842456k used, 53008k free, 59220k buffers Swap: 524280k total, 15024k used, 509256k free, 566516k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2114 root 15 0 166m 19m 1328 S 99.2 2.3 7:28.69 automount The way the cpu spikes it will shoot up for a few seconds then rest for between 2 and 5 seconds -- then shoot up to 99% or so for a few seconds and then rest again. Rinse, repeat. Thanks for the super quick responses, Ian. (In reply to comment #6) > On my ldap-sourced machine -- it looks really great right now. cpu cycles seem > under control and the ram is looking good after 10 minutes. OK, that's good. I'll go over the code as there were a couple of other places where this might be happening. I'll just change them to use a slightly different sequence of events to eliminate the potential for the problem just in case. I'll sort that out by Monday, in time for the exception deadline. > > $ ps aux |grep auto > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND > root 30586 0.4 1.2 120216 11388 ? Ssl 07:46 0:02 automount > > > Something I hadn't mentioned before was the extreme amount of cpu it was taking > up on the files-based map. This is still happening after this release. (I can > open a separate bz if you'd like) We'll have to open a separate bug for that because the change to fix this issue really needs to get into the 5.4 release. The scanning of file maps should be significantly less but clearly I've got that wrong somehow. It will spike when the map is read but shouldn't consult the file map again until the map file is modified. Or maybe it isn't actually reading the file maps causing the spike? > > $ date > Fri Jul 10 07:57:48 PDT 2009 > $ ps aux |grep auto > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND > root 2114 98.6 2.2 170308 20424 ? Ssl 07:41 6:23 automount > $ date > Fri Jul 10 07:57:51 PDT 2009 > > In just ~16 minutes of running it's already used over 6 minutes of cpu. This > is with the new patched version as well. With your indirect map, do you use the browse option (or --ghost) or have you either commented out BROWSE_MODE="no" or used BROWSE_MODE="yes". This will result in significant CPU usage for a map of that size. This is a known problem and has been with us for a long time (and it's about the last really big problem) and although I have though about it many times I still don't have way to resolve it. But, as it is so difficult I have left it till last so it will be getting some close attention soon. > > The memory has stayed put at 2.3 or below% however. That's looking MUCH better > so far. I'll let this run today and continue to monitor it. > > top - 08:00:14 up 1 day, 5:25, 1 user, load average: 0.92, 0.89, 0.68 > Tasks: 103 total, 3 running, 100 sleeping, 0 stopped, 0 zombie > Cpu(s): 30.0%us, 70.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Mem: 895464k total, 842456k used, 53008k free, 59220k buffers > Swap: 524280k total, 15024k used, 509256k free, 566516k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 2114 root 15 0 166m 19m 1328 S 99.2 2.3 7:28.69 automount > > The way the cpu spikes it will shoot up for a few seconds then rest for between > 2 and 5 seconds -- then shoot up to 99% or so for a few seconds and then rest > again. Rinse, repeat. The question is then, does this correspond to expire events or mount events. Expire events would be every timeout/4 seconds. > > Thanks for the super quick responses, Ian. My pleasure. Ian One other thing. The kernel, is it the RHEL-5.4 kernel? In this case the kernel is the RHEL 5.3 kernel. It will take me a few days to get the 5.4 kernel spun up but I can do that if you'd like me to test it there as well. If I can find time this weekend I will attempt it. Everything is still working great and the memory utilization actually seems to have dropped a bit. Nice work! I'll open a new BZ about the processor utilization as soon as I can. Thanks again Linux myhost 2.6.18-128.el5 #1 SMP Wed Dec 17 11:41:38 EST 2008 x86_64 x86_64 x86_64 GNU/Linux Also tested on: Linux myotherhost #1 SMP Wed Dec 17 11:42:39 EST 2008 i686 i686 i386 GNU/Linux (In reply to comment #10) > In this case the kernel is the RHEL 5.3 kernel. It will take me a few days to > get the 5.4 kernel spun up but I can do that if you'd like me to test it there > as well. If I can find time this weekend I will attempt it. OK, the reason I asked is that we can't realize all the CPU improvements without the 5.4 kernel. The 5.4 kernel includes the new autofs control ioctl interface and while the primary reason for this implementation wasn't to reduce CPU utilisation a feature was added that can help quite a bit. The source of the improvement is the is_mounted() function that checks if a path is mounted and if it is an autofs or other file system. We use the is_mounted() function a lot and it scans either /etc/mtab or /proc/mounts, as required, but when the new ioctl interface is in use we can ask the kernel directly for this, avoiding the scan altogether. With a large number of direct mounts this can give a significant improvement and even without a large direct map the improvement is quite noticeable. Another thing, if you're using a 5.3 base system and just upgrade the kernel and autofs then you need to tell autofs you want to use the new interface by adding a configuration option, as can be seen in the configuration of a fresh install: # # If the kernel supports using the autofs miscellanous device # and you wish to use it you must set this configuration option # to "yes" otherwise it will not be used. USE_MISC_DEVICE="yes" # > > Everything is still working great and the memory utilization actually seems to > have dropped a bit. Great. I've done a little more testing and have found no evidence that the other cases I mentioned are affected by this call order mistake, so the patch here may be all we need. > > Nice work! Thanks, but I had already been working on this and your report and the evidence I had collected from the upstream reporter caused the penny to drop as to the cause. So thanks for reporting it. > > I'll open a new BZ about the processor utilization as soon as I can. OK, but perhaps we should check with the 5.4 kernel before going ahead with that. > > Thanks again > > Linux myhost 2.6.18-128.el5 #1 SMP Wed Dec 17 11:41:38 EST 2008 x86_64 x86_64 > x86_64 GNU/Linux Right, the new ioctl interface went into rev 137 but a couple of other bug fixes went in since then also. > > Also tested on: > > Linux myotherhost #1 SMP Wed Dec 17 11:42:39 EST 2008 i686 i686 i386 GNU/Linux Mmmm ... no kernel version, ;) Ian I've gone through all the code and inspected the locations where this might potentially be a problem (twice) and have not found any other places where this issue is present. So the patch we have should be all that is needed. Could I have the needed acks to commit this to CVS please. Ian I've tested this in RHEL 5.3 on both i686 and x86_64 using ldap as a back-end for the automount maps with perfect results. Memory usage has stayed low since it was installed and there have been no abnormal terminations. Start up and shutdown remain slow (due to the large amount of maps) but the results are no different than we experienced before. I approve! I also did manage to get a full beta 5.4 installation going with our image. You are right -- those kernel mods made a huuuuge difference. Now instead of 60-90% system utilzation due to automount process -- it's down to less than 10%. It's still too high but you're on the right track it seems. (this is my 5.4 box) $ date Sat Jul 11 23:54:46 PDT 2009 $ ps aux |grep automount USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 12157 14.2 2.1 104740 19456 ? Ssl 23:41 1:57 automount $ date Sat Jul 11 23:54:48 PDT 2009 $ uname -a Linux yetanothermyhost 2.6.18-155.el5 #1 SMP Fri Jun 19 17:06:31 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux (with kernel version this time!) So with appx 14 minutes of run-time it's used up 1.57 minutes of cpu time. This is certainly a HUGE improvement vs a kernel without the enhancements but it's still not quite good enough for us to switch back to files yet. LDAP will have to do for now. However, memory usage is great, and that's what this ticket was to solve. Really though, thank you VERY much for your work on this ticket. It's going to make a big difference for our RedHat implementation. Please deploy your updates asap. If you send me the updated release I will install/test that as well with the same expediency if other changes were rolled in with this one in 129. (In reply to comment #13) > I've tested this in RHEL 5.3 on both i686 and x86_64 using ldap as a back-end > for the automount maps with perfect results. Memory usage has stayed low since > it was installed and there have been no abnormal terminations. Start up and > shutdown remain slow (due to the large amount of maps) but the results are no > different than we experienced before. Yes, there really isn't anything we can do about startup time. The simple fact is that we need to read the entire direct map in when we start. Indirect maps that do not use the browse option don't need to be read at start and they aren't in version 5 so the slowness must be due to the direct map. > > I approve! > > I also did manage to get a full beta 5.4 installation going with our image. > You are right -- those kernel mods made a huuuuge difference. Now instead of > 60-90% system utilzation due to automount process -- it's down to less than > 10%. It's still too high but you're on the right track it seems. That's a little disappointing although not entirely unexpected. Logging a bug to investigate this would be useful as I need to identify exactly where the remaining bottlenecks are, to make sure my suspicions are correct. > > (this is my 5.4 box) > $ date > Sat Jul 11 23:54:46 PDT 2009 > $ ps aux |grep automount > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND > root 12157 14.2 2.1 104740 19456 ? Ssl 23:41 1:57 automount > $ date > Sat Jul 11 23:54:48 PDT 2009 > $ uname -a > Linux yetanothermyhost 2.6.18-155.el5 #1 SMP Fri Jun 19 17:06:31 EDT 2009 > x86_64 x86_64 x86_64 GNU/Linux > (with kernel version this time!) > > So with appx 14 minutes of run-time it's used up 1.57 minutes of cpu time. > This is certainly a HUGE improvement vs a kernel without the enhancements but > it's still not quite good enough for us to switch back to files yet. LDAP will > have to do for now. However, memory usage is great, and that's what this > ticket was to solve. Yep. > > Really though, thank you VERY much for your work on this ticket. It's going to > make a big difference for our RedHat implementation. Please deploy your > updates asap. Will do. Getting the kernel update to a point suitable for upstream acceptance took a lot longer than I had hoped but that is behind us now and I can focus on further resource improvements. Having the current improvements in place will allow us to identify exactly where the remaining resource intensive code is (I suspect a couple of places). Further improvements will get harder from here but that's what development is about. And we need to be sure that we have targeted the right places. Ian (In reply to comment #14) > (In reply to comment #13) > > I've tested this in RHEL 5.3 on both i686 and x86_64 using ldap as a back-end > > for the automount maps with perfect results. Memory usage has stayed low since > > it was installed and there have been no abnormal terminations. Start up and > > shutdown remain slow (due to the large amount of maps) but the results are no > > different than we experienced before. > > Yes, there really isn't anything we can do about startup time. > The simple fact is that we need to read the entire direct map in > when we start. Indirect maps that do not use the browse option > don't need to be read at start and they aren't in version 5 so > the slowness must be due to the direct map. Sorry, what I've said here isn't correct any more. We'll pick this up in bug 510941 but we need to correct this statement here to avoid confusion if we refer back to this bug later. With the latest changes file maps should always be read at startup. Since we have to read the map at some point this was a trade off between spending the time at startup or spending time upon the first lookup. So, if anything, startup should be even slower than previously. Ian The correction identified in this bug is available in package autofs-5.0.1-0.rc2.130. (In reply to comment #17) > The correction identified in this bug is available in > package autofs-5.0.1-0.rc2.130. This package is also available for further testing at: http://people.redhat.com/~ikent/autofs-5.0.1-0.rc2.130 Hmmm I'm still using the original BZ release and have not tried 130 yet -- but autofs has been running for some time now and it doesn't seem to be responding to directory change requests. I've let a cd run for about 20 minutes now and haven't seen it actually work. Should I open a different BZ for this? I'm not sure why it's hanging like this. I've tried a few directories and haven't gotten into any automounted nfs dir yet. $ time cd /mnt/nfs bash: cd: /mnt/nfs: Interrupted system call real 30m53.465s user 0m0.000s sys 0m0.000s $ (In reply to comment #22) > Hmmm I'm still using the original BZ release and have not tried 130 yet -- but > autofs has been running for some time now and it doesn't seem to be responding > to directory change requests. > > I've let a cd run for about 20 minutes now and haven't seen it actually work. > Should I open a different BZ for this? I'm not sure why it's hanging like > this. I've tried a few directories and haven't gotten into any automounted nfs > dir yet. That's a big surprise given the testing I've done. Please open a new bug and post a sysreq-t dump if possible. If you can duplicate it with debug logging enabled that log would also be useful. Ian (In reply to comment #24) > (In reply to comment #22) > > Hmmm I'm still using the original BZ release and have not tried 130 yet -- but > > autofs has been running for some time now and it doesn't seem to be responding > > to directory change requests. > > > > I've let a cd run for about 20 minutes now and haven't seen it actually work. > > Should I open a different BZ for this? I'm not sure why it's hanging like > > this. I've tried a few directories and haven't gotten into any automounted nfs > > dir yet. > > That's a big surprise given the testing I've done. > > Please open a new bug and post a sysreq-t dump if possible. > If you can duplicate it with debug logging enabled that log would > also be useful. > Also, are there any messages in the log? Ian (In reply to comment #2) > Here's our auto.master and associated files: > > $ cat /etc/auto.master > # auto.master for autofs5 machines > /usr2 auto.home --timeout 60 > /- auto.direct --timeout 60 > /net -hosts > > $ cat /etc/auto.direct > +/etc/auto.projects > /opt/random/src -rw,noquota filerx:/vol/vol0/src > /opt/random/doc -rw,noquota filerx:/vol/vol0/doc > > the direct maps are there because we have multiple layers of nested mounts. so > auto.projects might look like the following: > > /mnt/dir1 -rw,noquota,intr filer1:/vol/subdir/dir1 > /mnt/dir1/dir2 -rw,noquota,intr filer2:/vol/subdir1/dir2 > /mnt/dir1/dir2/dir3 -rw,noquota,intr filer3:/vol/subdir1/dir2_dir3 > /mnt/dir1/dir2/dir3/dir4 -rw,noquota,intr > filer4:/vol/subdir1/dir2_dir3_dir4 > /mnt/dir1/dir2/dir3/dir4/include -rw,noquota,intr > filer5:/vol/subdir1/dir2_dir3_dir4_include Sorry, I didn't notice these nested direct mounts before. Are you sure this ever worked with version 5? Although I don't explicitly check for nesting in direct mount map entries they can't work and basically aren't supported. If they have worked previously then they were a problem waiting to happen. To do this you need to use submounts with either a direct mount or an indirect mount at the base of each tree of offset mounts. Once again, sorry I missed this before and sorry if the change to using strict direct mount sematics in version 5 is causing inconvenience but, as far as I know, this is the way it is with other industry standard automount implementations. Ian (In reply to comment #26) > > > > the direct maps are there because we have multiple layers of nested mounts. so > > auto.projects might look like the following: > > > > /mnt/dir1 -rw,noquota,intr filer1:/vol/subdir/dir1 > > /mnt/dir1/dir2 -rw,noquota,intr filer2:/vol/subdir1/dir2 > > /mnt/dir1/dir2/dir3 -rw,noquota,intr filer3:/vol/subdir1/dir2_dir3 > > /mnt/dir1/dir2/dir3/dir4 -rw,noquota,intr > > filer4:/vol/subdir1/dir2_dir3_dir4 > > /mnt/dir1/dir2/dir3/dir4/include -rw,noquota,intr > > filer5:/vol/subdir1/dir2_dir3_dir4_include > > Sorry, I didn't notice these nested direct mounts before. > Are you sure this ever worked with version 5? > > Although I don't explicitly check for nesting in direct mount > map entries they can't work and basically aren't supported. If > they have worked previously then they were a problem waiting to > happen. To do this you need to use submounts with either a > direct mount or an indirect mount at the base of each tree of > offset mounts. Actually, that's not correct. Multi-mount map entries are the way nested mount trees must be done even if submount maps are used to organize groups of entries. For example the above direct mount would need to be converted to a direct mount at the base of the tree and offsets from the first nesting point in the tree and would look something like: /mnt/dir1 \ / -rw,noquota,intr filer1:/vol/subdir/dir1 \ /dir2 -rw,noquota,intr filer2:/vol/subdir1/dir2 \ /dir2/dir3 -rw,noquota,intr filer3:/vol/subdir1/dir2_dir3 \ /dir2/dir3/dir4 -rw,noquota,intr filer4:/vol/subdir1/dir2_dir3_dir4 \ /dir2/dir3/dir4/include -rw,noquota,intr \ filer5:/vol/subdir1dir2_dir3_dir4_include Clearly, the path /mnt/dir1 could be a direct or indirect mount entry although I believe other implementations don't allow this for direct mount entries. The semantic behaviour of this type of map entry is specifically designed to handle nested trees of mounts and is the only way that nesting of mounts can be done. In version 4, multi-mount entries were a problem because every mount in the tree of offsets had to be mounted (and expired) as a single unit upon accessing the directory at the top of the tree. In version 5 entries in the nested tree are mounted and expired as you go to avoid this problem but the limitation below still exists. The limitation you need to be aware of with these entries is that changes to the multi-mount map entries cannot be seen until the entire tree is expired away and a mount triggered again. This is specifically because of possible dependencies due to the nesting. Ian This is not in direct response to your last post -- I can elaborate more on that in a few days. However, to address the problem I'm having with the BZ (and 130) -- the automounter WILL cease functioning after about 24-72 hours. The daemon is still running but it's non-responsive. After talking with Mike and Deke -- they suggested I add this to /etc/init.d/autofs ulimit -n 20480 ulimit -s 65535 This seems to have fixed the final issue I'm having. I've had sever .130 servers running at both rhel 5.4 and 5.3 in 32bit and 64bit for 72 hours now and I've not had any crashes since. This is a positive development. I'll check back on this ticket in a few days to let you know if it is still stable. I will open a new BZ (and an RH ticket) at that time (or perhaps sooner), however, because I don't want to hand-edit /etc/init.d/autofs on all of my hosts. It would be nice to have that controlled through /etc/sysconfig/autofs or even bump the default value too if that doesn't create a very negative effect on a system. I hope this helps you some. If you want to dive deeper into my specific issue -- I can easily reproduce the unresponsiveness without those settings. This last week+ has been excellent for our autofs environment. Thanks again. (In reply to comment #28) > This is not in direct response to your last post -- I can elaborate more on > that in a few days. > > However, to address the problem I'm having with the BZ (and 130) -- the > automounter WILL cease functioning after about 24-72 hours. The daemon is > still running but it's non-responsive. > > After talking with Mike and Deke -- they suggested I add this to > /etc/init.d/autofs > > ulimit -n 20480 > ulimit -s 65535 Mmmm, interesting. > > > This seems to have fixed the final issue I'm having. I've had sever .130 > servers running at both rhel 5.4 and 5.3 in 32bit and 64bit for 72 hours now > and I've not had any crashes since. This is a positive development. I'll > check back on this ticket in a few days to let you know if it is still stable. I'm not sure the setting the open file limit higher will have any effect as in the daemon we do: ... #define MAX_OPEN_FILES 10240 ... rlim.rlim_cur = MAX_OPEN_FILES; rlim.rlim_max = MAX_OPEN_FILES; res = setrlimit(RLIMIT_NOFILE, &rlim); if (res) warn(LOGOPT_NONE, "can't increase open file limit - continuing"); ... but changing the maximum open file limit in the daemon should be no big deal, and given the number of mounts I expect to be able to deal with, is probably good idea. As far as the maximum stack size goes, that's a bit more of a question. I explicitly set a stack size for worker threads and the call to create the stack attributes succeeds and subsequent calls to create the threads work also. I use the default stack size for the mount handling thread but its main job is to create worker threads. However, I don't call setrlimit(2) to increase the process maximum stack size so maybe that is the source of the difficulty. Again, not a big deal to add that and it seems like something I should have done at the outset. > > I will open a new BZ (and an RH ticket) at that time (or perhaps sooner), > however, because I don't want to hand-edit /etc/init.d/autofs on all of my > hosts. It would be nice to have that controlled through /etc/sysconfig/autofs > or even bump the default value too if that doesn't create a very negative > effect on a system. I'd rather handle as much as possible in the daemon itself so that it is automount specific, in a single place, and automatic. In time (but not any time soon) I'd like to eliminate the need to hold open file handles, particularly for direct and offset mounts, and only open them when needed for specific operations. This has only now become a possibility with the new ioctl implementation so I'm not in a rush to do this as too many changes at once will be a recipe for disaster and we already have a lot of changes. > > I hope this helps you some. If you want to dive deeper into my specific issue > -- I can easily reproduce the unresponsiveness without those settings. I'm not sure we need to go deeper into this, other than to work out if it is in fact the lack of a setrlimit(2) call to increase the allowable stack size. We probably need to discus the nesting of direct mounts further in order for you to fully understand the reasons for my comments and the reasons they are done the way they are. That's bound to be difficult in itself. > > This last week+ has been excellent for our autofs environment. Thanks again. That's good to hear, at least we're getting there. As I said before the some of the recent changes were a long time coming and a bunch of initiatives have landed, coincidently, all at once in the 5.4 release. Ian (In reply to comment #29) > > As far as the maximum stack size goes, that's a bit more of a > question. I explicitly set a stack size for worker threads > and the call to create the stack attributes succeeds and > subsequent calls to create the threads work also. I use > the default stack size for the mount handling thread but > its main job is to create worker threads. However, I don't > call setrlimit(2) to increase the process maximum stack size > so maybe that is the source of the difficulty. Again, not a > big deal to add that and it seems like something I should > have done at the outset. Mmmm, setting the stack size in the pthread thread creation attributes only sets the minimum stack size allowed so it probably doesn't actually do much. So using setrlimit(2) looks like the way to go. Ian Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Previously, the method used by autofs to clean up pthreads was not reliable and could result in a memory leak. If the memory leak occurred, autofs would gradually consume all available memory and then crash. A small semantic change in the code prevents this memory leak from occurring now. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-1397.html |