Red Hat Bugzilla – Bug 1489542
Behavior change in autofs expiry timer when a path walk is done following commit from BZ 1413523
Last modified: 2018-04-10 18:02:19 EDT
Description of problem: Autofs will unmount within the specified timeout regardless of queries/stats of the mount point. Prior to RHEL7.4 if a path was traversed this would renew the timeout value. Additionally, the hope for this BZ is to gain additional details on the new behavior and address some of the concerns in case 01909702 Version-Release number of selected component (if applicable): Kernel: 3.10.0-693.1.1.el7.x86_64 nfs-utils-1.3.0-0.48.el7.x86_64 Tue Aug 15 11:59:31 2017 rpcbind-0.2.0-42.el7.x86_64 Tue Aug 15 11:55:40 2017 nfs4-acl-tools-0.3.3-15.el7.x86_64 Tue Aug 15 12:03:08 2017 autofs-5.0.7-69.el7.x86_64 Tue Aug 15 12:01:39 2017 libsss_autofs-1.15.2-50.el7.x86_64 Tue Aug 15 11:58:45 2017 How reproducible: Always, with 'ls' and 'df' commands additionally occurs with 'noac' mount option. Steps to Reproduce: [1] Fresh install of RHEL7.4 [2] Installed generic versions of autofs, nfs and rpcbind packages [3] Create autofs mount (ideally with some low timeout for testing): EX: # grep nfs /etc/auto.master /- /etc/auto.nfs --timeout=60 # grep nfs /etc/auto.nfs /nfs -noac 10.13.153.156:/mnt/export1/ [4] Bump automount debugging all the way up # grep OPTION /etc/sysconfig/autofs OPTIONS="--debug" [5] Start autofs service [6] cd to automount [7] exit automount [8] Start some query of the mount point: EX: # while true; do df /nfs/ >/dev/null; sleep 10; done [9] Allow some time (2-3x timeout period) to pass then kill bash command [10] Check the /var/log/messages, observation that automount is unmounted and remounted several times during the test (typically equal to the number of timeout periods) Actual results: The mount point has a unmount event once the timeout hits, even though the network traffic and debug show that the mount point is being traversed. Expected results: The desire it to return the prior RHEL7.3 behavior in which path traversal results in a timeout refresh. Additional info: The new behavior seems to have been implemented in the below BZ and commit: Buzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1413523 Upstream commit: linux.git 092a53452b Depends: https://bugzilla.redhat.com/show_bug.cgi?id=1247935 Specifically related to sfdc case 01909702, customer usage and concerns are as followed: In our case, we have thousands of clients, and thousands of automounts configured via LDAP. There are a few main areas of concern for me: 1) Caching. This applies to all of the below concerns, in that wouldn't unmount of the NFS file system cause the cache to be invalidated, and wouldn't this cause performance issues during each of the use cases I am about to describe? 2) /home directories. All of our /home are automounts. For regular uses, this one is likely safe due to held open files including set working directories that would prevent the user's /home from being unmounted. The expiration logic probably doesn't trigger in this case, because it cannot be unmounted. However, for build systems that may have their working directory set to another location, they may still access content from /home like .ssh/id_rsa or .bashrc, and these would trigger the automounts. These automoutns would be re-established at the automount expiration interval. 3) Tools warehouse. We have a number of NFS paths that store versions of tools. These tools are accessed without setting a current working directory or holding any file open for an extended period of time. Think of tools like particular versions of "gcc" or "make", but imagine thousands of these. Thousands of machines access these NFS paths, and these automounts will be re-established at the automount expiration interval. 4) Source code. These are NFS project spaces that build systems such as Bamboo, Jenkins, Team City, or other systems we have may access this content, without setting a current working directory or holding any file open for an extended period of time. 5) Build output. These are NFS project spaces that build systems such as Bamboo, Jenkins, Team City, or other systems we have may write build output to, often for sharing with other users or build systems, without setting a current working directory or holding any file open for an extended period of time. 6) Application data. Our application servers frequently use NFS for scalable storage. In many cases, I have been strongly encouraged the use of fixed mounts, as I have found automounts to be unnecessarily brittle in the past. However, the default configuration for application owners in our organization who may not be aware of this best practice, still use automounts. Think of applications like JIRA or Confluence, that store attachments on NFS. The attachments are accessed only for the duration required to read the file, and then they are closed. These automounts will be re-established at the automount expiration interval. The reason we use automounts in the above cases, is that with the exception of 6), it can be much more overhead to publish fixed mounts on dozens or more machines that may need the resource. It is much easier to allow autofs to discover which machines which need which mounts. Any NFS mounts that might be used by a dozen or more machines, benefit from autofs. Any NFS mounts used by just one or two machines, can use fixed mounts instead. For our build systems, which have concerns 1) through 5) above, we might have 200 or more build machines, that have 20 or more automounts, each being re-established every 10 minutes. 200 x 20 = 4,000 mounts per 10 minutes. And this is just one of the scenarios. I understand the dilemma. There are UI systems that have file dialogs or file system monitoring systems that automatically discover new content and scan these file systems just because they exist. But, there are also legitimate use cases such as I describe above, that we would like to update "last_used". Ian Kent documented this dilemma here:
(In reply to smazul from comment #0) > Description of problem: > Autofs will unmount within the specified timeout regardless of queries/stats > of the mount point. Prior to RHEL7.4 if a path was traversed this would > renew the timeout value. > > Additionally, the hope for this BZ is to gain additional details on the new > behavior and address some of the concerns in case 01909702 Reading through the discussion below, I must say I'm not unsympathetic to the adverse effects of this change. So that means there are two things to discuss, first the reasons for the change and secondly what should be done about it. snip .... I don't think I need a reproducer, it's very likely this behaviour change is due to the fix for the original upstream regression regarding the last_used update. Specifically, as mentioned in the case: - [fs] autofs: take more care to not update last_used on path walk (Ian Kent) [1413523] > > Actual results: > The mount point has a unmount event once the timeout hits, even though the > network traffic and debug show that the mount point is being traversed. > > Expected results: > The desire it to return the prior RHEL7.3 behavior in which path traversal > results in a timeout refresh. That's almost how autofs is supposed to work. A long time ago autofs adopted the strategy of only preventing expiry of mounts that are really, really in use which meant that the last_used was mostly not updated on path walks specifically to avoid user space utilities, monitoring systems etc. from preventing mount expiry. But, on each expire event the last_used field is updated if the mount is really in use, which means at least one process has a working directory in the file system or there are open files in the file system. But due to upstream changes, probably by me, that regressed. In recent times this became a significant problem for me, to the extent that virtually all mounts that were *not* in use were being instantly re-mounted after being expired. This is what lead me to discover the regression (at least a regression from my POV anyway). > > > Additional info: > The new behavior seems to have been implemented in the below BZ and commit: > > Buzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1413523 Yes, this is the one. > Upstream commit: linux.git 092a53452b > Depends: https://bugzilla.redhat.com/show_bug.cgi?id=1247935 This dependency was needed for a later patch series related to namespace handling (related to undesirable behaviours when using containers) so doesn't really affect whether the problem patch is reverted or not. It's pretty much an isolated change so reverting it it should not have side effects. > > Specifically related to sfdc case 01909702, customer usage and concerns are > as followed: > > In our case, we have thousands of clients, and thousands of automounts > configured via LDAP. There are a few main areas of concern for me: > > 1) Caching. This applies to all of the below concerns, in that wouldn't > unmount of the NFS file system cause the cache to be invalidated, and > wouldn't this cause performance issues during each of the use cases I am > about to describe? The NFS attribute cache, or buffer cache, or CacheFS cache? I think all of these would be invalidated. I don't think invalidating the attribute cache would be a significant problem. The buffer cache, perhaps, the CacheFS cache more so. > > 2) /home directories. All of our /home are automounts. For regular uses, > this one is likely safe due to held open files including set working > directories that would prevent the user's /home from being unmounted. The > expiration logic probably doesn't trigger in this case, because it cannot be > unmounted. However, for build systems that may have their working directory > set to another location, they may still access content from /home like > .ssh/id_rsa or .bashrc, and these would trigger the automounts. These > automoutns would be re-established at the automount expiration interval. That's right, working directories or open files will cause the last_used field to be updated on expire events. I'm not sure this is a terribly big issue as the umount to mount turnover shouldn't be that significant. Is it actually a big problem in your environment? > > 3) Tools warehouse. We have a number of NFS paths that store versions of > tools. These tools are accessed without setting a current working directory > or holding any file open for an extended period of time. Think of tools like > particular versions of "gcc" or "make", but imagine thousands of these. > Thousands of machines access these NFS paths, and these automounts will be > re-established at the automount expiration interval. Sure, common and sensible use of automounting. This would be a significant problem in large environments like yours, point well taken. > > 4) Source code. These are NFS project spaces that build systems such as > Bamboo, Jenkins, Team City, or other systems we have may access this > content, without setting a current working directory or holding any file > open for an extended period of time. > > 5) Build output. These are NFS project spaces that build systems such as > Bamboo, Jenkins, Team City, or other systems we have may write build output > to, often for sharing with other users or build systems, without setting a > current working directory or holding any file open for an extended period of > time. Same as point 3) I think, also point well taken. > > 6) Application data. Our application servers frequently use NFS for scalable > storage. In many cases, I have been strongly encouraged the use of fixed > mounts, as I have found automounts to be unnecessarily brittle in the past. > However, the default configuration for application owners in our > organization who may not be aware of this best practice, still use > automounts. Think of applications like JIRA or Confluence, that store > attachments on NFS. The attachments are accessed only for the duration > required to read the file, and then they are closed. These automounts will > be re-established at the automount expiration interval. > > The reason we use automounts in the above cases, is that with the exception > of 6), it can be much more overhead to publish fixed mounts on dozens or > more machines that may need the resource. It is much easier to allow autofs > to discover which machines which need which mounts. Any NFS mounts that > might be used by a dozen or more machines, benefit from autofs. Any NFS > mounts used by just one or two machines, can use fixed mounts instead. For > our build systems, which have concerns 1) through 5) above, we might have > 200 or more build machines, that have 20 or more automounts, each being > re-established every 10 minutes. 200 x 20 = 4,000 mounts per 10 minutes. And > this is just one of the scenarios. Sure, I understand this. Other use cases of these types are render farms and Geophysics data processing sites. These types of environments tend to be smaller but the problem is the same as yours. Are the access patterns for the point 6) case that bad in terms of umount to mount turnover? > > I understand the dilemma. There are UI systems that have file dialogs or > file system monitoring systems that automatically discover new content and > scan these file systems just because they exist. But, there are also > legitimate use cases such as I describe above, that we would like to update > "last_used". Ian Kent documented this dilemma here: Seems my quote is missing, ;) But you are correct, I didn't consider very large environments in my anger at the user space changes that caused me so much recent pain, all I can do is apologize and try and work out some way to improve the large use problem. In my defence, what I thought I was doing was resolving a regression to a behaviour that I established some years ago that I thought was the right way to do things, when autofs version 5 was being developed. What makes this problem hard is not deciding whether to revert this particular change, I can do that if we agree it needs to be done for RHEL-7 (and it sounds like we do need too). The deeper problem is more long term in that it will just come back again in RHEL-8 if I can't find a smarter way to do this than just not updating the last_used field in certain cases, as is done now. After all, if I revert this change upstream I'm pretty sure I'll get a bunch of bugs when RHEL-8 is released about mounts never expiring which could easily affect a quite large number of customers with large and small environmentss. Ian
For the purposes of the immediate next steps for this bug we need to verify what has caused the change in behaviour. I will check through all the patches in the series of bug 1413523 and review any that touched the last_used field. Once I have done that I'll produce a test kernel with reverted changes to check if the change in behaviour has in fact been reverted. Once this is done we can focus our attention on what should be done to fix the problem. Ian
I think I have made the bug public. If the customer still can't see the comments we can add a suitable email address to the bug cc list which should do the trick.
One of the things I was looking into when I discovered this concern in RHEL 7.4, was high LDAP query volumes, many of which were coming from autofs. When the automount needs to be established this not only has the overhead of performing the NFS protocol auto-negotiation (NFS v3 or NFS v4? ...) on the autofs side, and then on the Linux side, but prior to this it requires LDAP queries, and prior to this, it often needs to establish a TLS session with LDAP. All of this means that even if the underlying "mount" operation is fast, the end-to-end process here can take 2 or more seconds to complete. I don't want to get distracted by these details, as the details here are not fully relevant. But, just to provide some context in terms of what the cost of automount expiration could be in a worst case scenario. > > 1) Caching. This applies to all of the below concerns, in that wouldn't > > unmount of the NFS file system cause the cache to be invalidated, and > > wouldn't this cause performance issues during each of the use cases I am > > about to describe? > > The NFS attribute cache, or buffer cache, or CacheFS cache? > > I think all of these would be invalidated. > > I don't think invalidating the attribute cache would be a > significant problem. > > The buffer cache, perhaps, the CacheFS cache more so. I am mostly thinking about the buffer cache. When users or build services perform builds in NFS, this includes: 1) Establish the source workspace. This may be thousands to millions of source files or binary files. In some cases this can be incremental, however, it is a fairly common practice for the small to medium workspaces to be clean extracts to ensure safe builds. 2) Build the source. This requires reading back many of these source files a little while later (often later than the 5 or so seconds that the NFS attribute cache is most effective for). Some files like .h files may be repeatedly read back. 3) Publish the build output. This requires reading back the generated files a little while later (often later than the 5 or so seconds that the NFS attribute cache is most effective for). I believe you are correct that the attribute cache is not really applicable here, as it already expires frequently. The buffer cache, however, allows the most commonly used objects to be stored in RAM, and made accessible after a relatively short attribute check to confirm that the buffer cache is still valid. Similarly, for tools warehouse scenarios - there may be a large number of files required to publish a cross compiler tool chain to NFS, and these may be called repeatedly. While they remain in buffer cache, there is still the attribute query overhead but not the file read overhead. If the mounts expire frequently, even when the mount is in regular use, then these would need to be re-read into buffer cache. We don't use CacheFS today. I have wondered if we should. I think this would be a separate consideration and outside the scope of this report. > I'm not sure this is a terribly big issue as the umount to mount > turnover shouldn't be that significant. For /home, I don't think it is much of an issue. But, for the "time" question, I want to add some context that my concern isn't really for the "time to mount", but the amount of extra overhead on the thousands of NFS clients, and dozens of NFS servers, that this issue potentially results in. > Are the access patterns for the point 6) case that bad in terms > of umount to mount turnover? For point 6) specifically, I only mentioned it because I wanted to be complete about our use cases. I don't think point 6) would have that much impact in real life, as I expect the applications likely hold at least one file open (perhaps just a lock file!), that would prevent unmount, and cause last_used to get updated. > But you are correct, I didn't consider very large environments > in my anger at the user space changes that caused me so much > recent pain, all I can do is apologize and try and work out > some way to improve the large use problem. > > In my defence, what I thought I was doing was resolving a > regression to a behaviour that I established some years ago > that I thought was the right way to do things, when autofs > version 5 was being developed. Understood. > The deeper problem is more long term in that it will just come > back again in RHEL-8 if I can't find a smarter way to do this > than just not updating the last_used field in certain cases, as > is done now. > > After all, if I revert this change upstream I'm pretty sure > I'll get a bunch of bugs when RHEL-8 is released about mounts > never expiring which could easily affect a quite large number > of customers with large and small environmentss. Yep, agree. I think some aspect of this comes down to: 1) How to define "use"? 2) What abilities exist to detected "use"? Having read about the issue you were trying to solve, I have a new appreciation for 1). Having recently reviewed some of the autofs and kernel code for the first time, I have an appreciation for 2). I think there are at least three scenarios here to consider: A) Applications which are overly aggressive about monitoring all file systems, even file systems which should not be their concern. B) Applications which are correctly aggressive about monitoring file systems, that should be their concern. C) Applications which are actively using the file systems to perform a significant amount of work in the file system. In RHEL 7.2 and RHEL 7.3, "use" seems to include at least: - File handle open. - Working directory set. - Automount path traversed at least once. In RHEL 7.4, "use" seems to be reduced to: - File handle open. - Working directory set. I think C) from above is pretty easy to detect. I think repeated and persistent use of the file system easily differentiates C). In the "build" and "tools warehouse" cases, we would probably traverse the path thousands or millions of times during a 10 minute interval. This makes me think that you could choose a compromise in here: - File handle open. - Working directory set. - Automount path traversed at least N times. I wonder if you were to keep a "last_used" and a "last_used_count", would it be sufficient to confirm that "last_used_count > 100" or "last_used_count > 1000", and this would result in a decision that was often correct in terms of identifying passive monitoring use vs persistent active use? As the expiration interval could have a large range depending upon use case, perhaps it would make sense to make N be proportional to the expiration interval? For example: - File handle open. - Working directory set. - Automount path traversed at least 10 times per minute. Back to the three scenarios: A) Applications which are overly aggressive about monitoring all file systems, even file systems which should not be their concern. B) Applications which are correctly aggressive about monitoring file systems, that should be their concern. C) Applications which are actively using the file systems to perform a significant amount of work in the file system. I think something like the above would help to more accurately identify C). Then, I am thinking that B) may not be important to distinguish from A) in real life, as the overhead should be low, and the use case may not be valid. Anyways - just some ideas to continue the conversation. Thanks for looking into this, Ian. It is much appreciated.
(In reply to Mark Mielke from comment #8) > One of the things I was looking into when I discovered this concern in RHEL > 7.4, was high LDAP query volumes, many of which were coming from autofs. > When the automount needs to be established this not only has the overhead of > performing the NFS protocol auto-negotiation (NFS v3 or NFS v4? ...) on the > autofs side, and then on the Linux side, but prior to this it requires LDAP > queries, and prior to this, it often needs to establish a TLS session with > LDAP. All of this means that even if the underlying "mount" operation is > fast, the end-to-end process here can take 2 or more seconds to complete. I > don't want to get distracted by these details, as the details here are not > fully relevant. But, just to provide some context in terms of what the cost > of automount expiration could be in a worst case scenario. Mmm ... yes, I'm aware of that. This behaviour came about due to complaints about autofs maps needing to be always up to date without the need to issue a HUP signal (so it only applies to indirect mount maps). For remote autofs maps a query needs to be done on "every" lookup to try and work out if the map has changed. It isn't a problem for file maps as the modified date is easily checked. It does introduce additional overhead I would rather avoid. I have considered doing something like the periodic server monitoring of am-utils amd automounter instead of on every mount lookup but that would make the proximity/availability calculation unreliable so it's not high on the priorities list. Still it might be adequate and would reduce the traffic somewhat. > > > > 1) Caching. This applies to all of the below concerns, in that wouldn't > > > unmount of the NFS file system cause the cache to be invalidated, and > > > wouldn't this cause performance issues during each of the use cases I am > > > about to describe? > > > The NFS attribute cache, or buffer cache, or CacheFS cache? > > > > I think all of these would be invalidated. > > > > I don't think invalidating the attribute cache would be a > > significant problem. > > > > The buffer cache, perhaps, the CacheFS cache more so. > > I am mostly thinking about the buffer cache. > > When users or build services perform builds in NFS, this includes: > > 1) Establish the source workspace. This may be thousands to millions of > source files or binary files. In some cases this can be incremental, > however, it is a fairly common practice for the small to medium workspaces > to be clean extracts to ensure safe builds. > > 2) Build the source. This requires reading back many of these source files a > little while later (often later than the 5 or so seconds that the NFS > attribute cache is most effective for). Some files like .h files may be > repeatedly read back. > > 3) Publish the build output. This requires reading back the generated files > a little while later (often later than the 5 or so seconds that the NFS > attribute cache is most effective for). > > I believe you are correct that the attribute cache is not really applicable > here, as it already expires frequently. The buffer cache, however, allows > the most commonly used objects to be stored in RAM, and made accessible > after a relatively short attribute check to confirm that the buffer cache is > still valid. Yes, the NFS attributes are included in most NFS RPC replies so they are always being updated regardless of mounted status. > > Similarly, for tools warehouse scenarios - there may be a large number of > files required to publish a cross compiler tool chain to NFS, and these may > be called repeatedly. While they remain in buffer cache, there is still the > attribute query overhead but not the file read overhead. If the mounts > expire frequently, even when the mount is in regular use, then these would > need to be re-read into buffer cache. Indeed, the buffer cache can make a big difference to IO. TBH I don't know much about the buffer cache other than the obvious significant effect it can have on reducing IOs. > > We don't use CacheFS today. I have wondered if we should. I think this would > be a separate consideration and outside the scope of this report. I also don't have much information about CacheFS but I don't need to as the maintainer is part of our file systems group. Maybe worth you following up with a discovery query at some point as we do have specialist knowledge here at RedHat. > > > I'm not sure this is a terribly big issue as the umount to mount > > turnover shouldn't be that significant. > > For /home, I don't think it is much of an issue. But, for the "time" > question, I want to add some context that my concern isn't really for the > "time to mount", but the amount of extra overhead on the thousands of NFS > clients, and dozens of NFS servers, that this issue potentially results in. > > > Are the access patterns for the point 6) case that bad in terms > > of umount to mount turnover? > > For point 6) specifically, I only mentioned it because I wanted to be > complete about our use cases. I don't think point 6) would have that much > impact in real life, as I expect the applications likely hold at least one > file open (perhaps just a lock file!), that would prevent unmount, and cause > last_used to get updated. Right, it sounds like I'm sufficiently up with this usage pattern. > > > But you are correct, I didn't consider very large environments > > in my anger at the user space changes that caused me so much > > recent pain, all I can do is apologize and try and work out > > some way to improve the large use problem. > > > > In my defence, what I thought I was doing was resolving a > > regression to a behaviour that I established some years ago > > that I thought was the right way to do things, when autofs > > version 5 was being developed. > > Understood. > > > The deeper problem is more long term in that it will just come > > back again in RHEL-8 if I can't find a smarter way to do this > > than just not updating the last_used field in certain cases, as > > is done now. > > > > After all, if I revert this change upstream I'm pretty sure > > I'll get a bunch of bugs when RHEL-8 is released about mounts > > never expiring which could easily affect a quite large number > > of customers with large and small environmentss. > > Yep, agree. > > I think some aspect of this comes down to: > > 1) How to define "use"? > 2) What abilities exist to detected "use"? I think it's time to modify the policy of "in use" that guides changes like this. It is hard for me to keep in mind large site use cases as I've moved out of the data centre and into development and support of autofs many years ago now. Nevertheless the larger the environment the more benefit autofs can be and, while a number of autofs uses don't quite get the large site case, it is and should be the target use case IMHO. > > Having read about the issue you were trying to solve, I have a new > appreciation for 1). > > Having recently reviewed some of the autofs and kernel code for the first > time, > I have an appreciation for 2). > > I think there are at least three scenarios here to consider: > > A) Applications which are overly aggressive about monitoring all file > systems, even file systems which should not be their concern. > B) Applications which are correctly aggressive about monitoring file > systems, that should be their concern. > C) Applications which are actively using the file systems to perform a > significant amount of work in the file system. > > In RHEL 7.2 and RHEL 7.3, "use" seems to include at least: > > - File handle open. > - Working directory set. > - Automount path traversed at least once. > > In RHEL 7.4, "use" seems to be reduced to: > > - File handle open. > - Working directory set. > > I think C) from above is pretty easy to detect. I think repeated and > persistent use of the file system easily differentiates C). In the "build" > and "tools warehouse" cases, we would probably traverse the path thousands > or millions of times during a 10 minute interval. This makes me think that > you could choose a compromise in here: > > - File handle open. > - Working directory set. > - Automount path traversed at least N times. > > I wonder if you were to keep a "last_used" and a "last_used_count", would it > be sufficient to confirm that "last_used_count > 100" or "last_used_count > > 1000", and this would result in a decision that was often correct in terms > of identifying passive monitoring use vs persistent active use? > > As the expiration interval could have a large range depending upon use case, > perhaps it would make sense to make N be proportional to the expiration > interval? For example: > > - File handle open. > - Working directory set. > - Automount path traversed at least 10 times per minute. > > Back to the three scenarios: > > A) Applications which are overly aggressive about monitoring all file > systems, even file systems which should not be their concern. > B) Applications which are correctly aggressive about monitoring file > systems, that should be their concern. > C) Applications which are actively using the file systems to perform a > significant amount of work in the file system. > > I think something like the above would help to more accurately identify C). > Then, I am thinking that B) may not be important to distinguish from A) in > real life, as the overhead should be low, and the use case may not be valid. Yes, I'm thinking along the lines of some sort of frequency calculation. It would need to be very simple so as to not add too much overhead to path lookups and ideally independent of the expire time setting. Still not sure yet. The reality is it was like this at RHEL-7 GA and we don't have the problem (to a large degree, although it is there) in RHEL-7 so there shouldn't be problems with reverting this one change, assuming it really is the change I think it is. There were two patches in the RHEL-7 sync. with upstream change that touch last_used, one was part of an upstream path walk series that tries to optimize path walks for certain work loads (one work load example was application building). I don't think that patch has any effect on what we are seeing here because it only moved the update out of a frequently called function to a location that updates the last used once at completion of the procedure. Only a small part of the overall optimization but still part of it. Anyway, I'm just getting back to this to build a test kernel with the patch I think changed this reverted so hopefully I'll have something for you to test (your) tomorrow, LOL or perhaps your today depending on when you see this. > > Anyways - just some ideas to continue the conversation. > > Thanks for looking into this, Ian. It is much appreciated. I appreciate your effort in evaluating and describing your use case. It's very important in keeping me focused on what's most for important for autofs and TBH I really don't get enough of it which can lead to not so good decisions on my part. Ian
Mmm ... that took a lot longer than it should have. I have a test build with the recent last_used modifier patch reverted. It can be found at: http://people.redhat.com/~ikent/kernel-3.10.0-693.2.2.el7.bz1489542.1/ This build isn't a release build and so it is not signed, it's for testing only. Please check if this kernel resolves the expiration problem we have discussed here in this bug. Ian
I tried out your test kernel, and I am unable to reproduce the problem that I reported. Absolute path accesses are holding the mount as "used" as expected, and the same as RHEL 7.3 and prior.
(In reply to Mark Mielke from comment #11) > I tried out your test kernel, and I am unable to reproduce the problem that > I reported. Absolute path accesses are holding the mount as "used" as > expected, and the same as RHEL 7.3 and prior. Ok, thought that would be the case. I'll go ahead with the process to revert the change.
Created attachment 1328172 [details] Patch - revert: take more care to not update last_used on path walk
Created attachment 1343778 [details] Trivial testcase showing the regression and fix. Passes on RHEL7.3 kernel # ./test.sh Thu Oct 26 09:52:03 EDT 2017: System Configuration 3.10.0-514.6.1.el7.x86_64 autofs-5.0.7-56.el7.x86_64 /net -hosts --timeout=60 Thu Oct 26 09:52:03 EDT 2017: first access - initiates mount request grep rhel7u4-node2 /proc/mounts -hosts /net/rhel7u4-node2/exports autofs rw,relatime,fd=12,pgrp=23643,timeout=60,minproto=5,maxproto=5,offset 0 0 rhel7u4-node2:/exports /net/rhel7u4-node2/exports nfs4 rw,nosuid,nodev,relatime,vers=4.0,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.122.6,local_lock=none,addr=192.168.122.8 0 0 Thu Oct 26 09:52:03 EDT 2017: sleep 30 seconds == 1/2 of expiry timer Thu Oct 26 09:52:33 EDT 2017: check for presence of mount grep rhel7u4-node2 /proc/mounts -hosts /net/rhel7u4-node2/exports autofs rw,relatime,fd=12,pgrp=23643,timeout=60,minproto=5,maxproto=5,offset 0 0 rhel7u4-node2:/exports /net/rhel7u4-node2/exports nfs4 rw,nosuid,nodev,relatime,vers=4.0,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.122.6,local_lock=none,addr=192.168.122.8 0 0 Thu Oct 26 09:52:33 EDT 2017: second access - initiates path walk but do not open any files grep rhel7u4-node2 /proc/mounts -hosts /net/rhel7u4-node2/exports autofs rw,relatime,fd=12,pgrp=23643,timeout=60,minproto=5,maxproto=5,offset 0 0 rhel7u4-node2:/exports /net/rhel7u4-node2/exports nfs4 rw,nosuid,nodev,relatime,vers=4.0,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.122.6,local_lock=none,addr=192.168.122.8 0 0 Thu Oct 26 09:52:33 EDT 2017: sleep 45 seconds == 1/2 of expiry timer + 15 seconds Thu Oct 26 09:53:18 EDT 2017: check for presence of mount - will be present if path walk via ls command reset timer TEST PASS: autofs expiry timer reset on path walk via ls command Fails with RHEL7.4 kernel # ./test.sh Thu Oct 26 09:54:46 EDT 2017: System Configuration 3.10.0-693.el7.x86_64 autofs-5.0.7-69.el7.x86_64 /net -hosts --timeout=60 Thu Oct 26 09:54:46 EDT 2017: first access - initiates mount request grep rhel7u4-node2 /proc/mounts -hosts /net/rhel7u4-node2/exports autofs rw,relatime,fd=13,pgrp=23857,timeout=60,minproto=5,maxproto=5,offset,pipe_ino=128489 0 0 rhel7u4-node2:/exports /net/rhel7u4-node2/exports nfs4 rw,nosuid,nodev,relatime,vers=4.0,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.122.32,local_lock=none,addr=192.168.122.8 0 0 Thu Oct 26 09:54:46 EDT 2017: sleep 30 seconds == 1/2 of expiry timer Thu Oct 26 09:55:16 EDT 2017: check for presence of mount grep rhel7u4-node2 /proc/mounts -hosts /net/rhel7u4-node2/exports autofs rw,relatime,fd=13,pgrp=23857,timeout=60,minproto=5,maxproto=5,offset,pipe_ino=128489 0 0 rhel7u4-node2:/exports /net/rhel7u4-node2/exports nfs4 rw,nosuid,nodev,relatime,vers=4.0,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.122.32,local_lock=none,addr=192.168.122.8 0 0 Thu Oct 26 09:55:17 EDT 2017: second access - initiates path walk but do not open any files grep rhel7u4-node2 /proc/mounts -hosts /net/rhel7u4-node2/exports autofs rw,relatime,fd=13,pgrp=23857,timeout=60,minproto=5,maxproto=5,offset,pipe_ino=128489 0 0 rhel7u4-node2:/exports /net/rhel7u4-node2/exports nfs4 rw,nosuid,nodev,relatime,vers=4.0,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.122.32,local_lock=none,addr=192.168.122.8 0 0 Thu Oct 26 09:55:17 EDT 2017: sleep 45 seconds == 1/2 of expiry timer + 15 seconds Thu Oct 26 09:56:02 EDT 2017: check for presence of mount - will be present if path walk via ls command reset timer TEST FAIL: autofs expiry timer not reset on path walk via ls command Passes again with the test kernel from https://bugzilla.redhat.com/show_bug.cgi?id=1489542#c10 # ./test.sh Thu Oct 26 10:01:28 EDT 2017: System Configuration 3.10.0-693.2.2.el7.bz1489542.1.x86_64 autofs-5.0.7-69.el7.x86_64 /net -hosts --timeout=60 Thu Oct 26 10:01:28 EDT 2017: first access - initiates mount request grep rhel7u4-node2 /proc/mounts -hosts /net/rhel7u4-node2/exports autofs rw,relatime,fd=13,pgrp=999,timeout=60,minproto=5,maxproto=5,offset,pipe_ino=17809 0 0 rhel7u4-node2:/exports /net/rhel7u4-node2/exports nfs4 rw,nosuid,nodev,relatime,vers=4.0,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.122.32,local_lock=none,addr=192.168.122.8 0 0 Thu Oct 26 10:01:29 EDT 2017: sleep 30 seconds == 1/2 of expiry timer Thu Oct 26 10:01:59 EDT 2017: check for presence of mount grep rhel7u4-node2 /proc/mounts -hosts /net/rhel7u4-node2/exports autofs rw,relatime,fd=13,pgrp=999,timeout=60,minproto=5,maxproto=5,offset,pipe_ino=17809 0 0 rhel7u4-node2:/exports /net/rhel7u4-node2/exports nfs4 rw,nosuid,nodev,relatime,vers=4.0,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.122.32,local_lock=none,addr=192.168.122.8 0 0 Thu Oct 26 10:01:59 EDT 2017: second access - initiates path walk but do not open any files grep rhel7u4-node2 /proc/mounts -hosts /net/rhel7u4-node2/exports autofs rw,relatime,fd=13,pgrp=999,timeout=60,minproto=5,maxproto=5,offset,pipe_ino=17809 0 0 rhel7u4-node2:/exports /net/rhel7u4-node2/exports nfs4 rw,nosuid,nodev,relatime,vers=4.0,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.122.32,local_lock=none,addr=192.168.122.8 0 0 Thu Oct 26 10:01:59 EDT 2017: sleep 45 seconds == 1/2 of expiry timer + 15 seconds Thu Oct 26 10:02:44 EDT 2017: check for presence of mount - will be present if path walk via ls command reset timer TEST PASS: autofs expiry timer reset on path walk via ls command
is the kbase article protected on purpose? even when logged in with my user I can't view it
(In reply to Klaas Demter from comment #22) > is the kbase article protected on purpose? even when logged in with my user > I can't view it Sorry it is incomplete so not published yet. I will publish it soon.
Patch(es) committed on kernel repository and an interim kernel build is undergoing testing
Patch(es) available on kernel-3.10.0-822.el7
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:1062