From Bugzilla Helper: User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.7.8) Gecko/20050511 Firefox/1.0.4 Description of problem: I have a cluster with two partitions, /home/cw and /misc/data_kzm, that are automounted to all nodes in the cluster. There is a symbolic link, /home/cw/DATA3D, that points to /misc/data_kzm/cw/DATA3D. Most of the time this link works without a problem. When I run code that writes to /home/cw/DATA3D, however, it crashes, complaining that there is no such file or directory. If I log in to the node and ls -l /misc/data_kzm/cw, it gives me a flashing red name for DATA3D as if it is a broken link, even though the rest of the partition is mounted and acting normally. Furthermore, I can avoid the whole problem by logging in to the node before the job runs and entering /misc/data_kzm/cw/DATA3D. We didn't have this problem when we were running RedHat 3.3; it's only since we upgraded to 4.0 (with autofs-4.1.3-155) that this has shown up. Version-Release number of selected component (if applicable): autofs-4.1.3-155, 2.6.9-11.ELsmp How reproducible: Always Steps to Reproduce: 1. umount all autofs-mounted partitions 2. run code that writes to /home/cw/DATA3D 3. Actual Results: End up with a directory that is unreadable, even though the rest of the partition is: ls -l /misc/data_kzm/cw/ total 16 ?--------- ? ? ? ? ? DATA3D drwxrwxr-x 2 cw cw 4096 Nov 29 14:52 OBJ drwxrwxr-x 2 cw cw 4096 Dec 1 18:00 RESTART Additional info: auto.master: # $411id: /etc/auto/master$ # Retrieved: 02-Dec-2005 09:06 # Master server: 10.1.1.1 # Last modified on master: 18-Nov-2005 17:34 # Encrypted file size: 417 bytes # # Owner: 0.0 # Name: /etc/auto.master # Mode: 0100444 /home /etc/auto.home --timeout=600 /misc /etc/auto.misc --timeout=600 --debug auto.misc: # $411id: /etc/auto/misc$ # Retrieved: 02-Dec-2005 09:06 # Master server: 10.1.1.1 # Last modified on master: 01-Dec-2005 14:17 # Encrypted file size: 1.2K bytes # # Owner: 0.0 # Name: /etc/auto.misc # Mode: 0100644 # # $Id: auto.misc,v 1.2 2003/09/29 08:22:35 raven Exp $ # # This is an automounter map and it has the following format # key [ -mount-options-separated-by-comma ] location # Details may be found in the autofs(5) manpage cd -fstype=iso9660,ro,nosuid,nodev :/dev/cdrom data_eli 10.1.1.162:/data_eli data_kzm 10.1.1.162:/data_kzm from /var/log/debug: Dec 2 10:23:59 compute-3-92 automount[2936]: starting automounter version 4.1.3-155, path = /misc, maptype = file, mapname = /etc/auto.misc Dec 2 10:23:59 compute-3-92 automount[2936]: mount(bind): bind_works = 1 Dec 2 10:23:59 compute-3-92 automount[2936]: using kernel protocol version 4.05 Dec 2 10:23:59 compute-3-92 automount[2936]: using timeout 600 seconds; freq 150 secs Dec 2 10:25:09 compute-3-92 automount[2936]: handle_packet: type = 0 Dec 2 10:25:09 compute-3-92 automount[2936]: handle_packet_missing: token 133, name data_kzm Dec 2 10:25:09 compute-3-92 automount[2936]: attempting to mount entry /misc/data_kzm Dec 2 10:25:09 compute-3-92 automount[3007]: lookup(file): data_kzm -> 10.1.1.162:/data_kzm Dec 2 10:25:09 compute-3-92 automount[3007]: parse(sun): expanded entry: 10.1.1.162:/data_kzm Dec 2 10:25:09 compute-3-92 automount[3007]: parse(sun): gathered options: Dec 2 10:25:09 compute-3-92 automount[3007]: parse(sun): dequote("10.1.1.162:/data_kzm") -> 10.1.1.162:/data_kzm Dec 2 10:25:09 compute-3-92 automount[3007]: parse(sun): core of entry: options=, loc=10.1.1.162:/data_kzm Dec 2 10:25:09 compute-3-92 automount[3007]: parse(sun): mounting root /misc, mount point data_kzm, what 10.1.1.162:/data_kzm, fstype nfs, options <NULL> Dec 2 10:25:09 compute-3-92 automount[3007]: mount(nfs): root=/misc name=data_kzm what=10.1.1.162:/data_kzm, fstype=nfs, options=<NULL> Dec 2 10:25:09 compute-3-92 automount[3007]: mount(nfs): is_bad_host: 10.1.1.162:/data_kzm Dec 2 10:25:09 compute-3-92 automount[3007]: mount(nfs): is_local_mount: 10.1.1.162:/data_kzm Dec 2 10:25:09 compute-3-92 automount[3007]: mount(nfs): from 10.1.1.162:/data_kzm elected 10.1.1.162:/data_kzm Dec 2 10:25:09 compute-3-92 automount[3007]: mount(nfs): calling mkdir_path /misc/data_kzm Dec 2 10:25:09 compute-3-92 automount[3007]: mount(nfs): calling mount -t nfs 10.1.1.162:/data_kzm /misc/data_kzm Dec 2 10:25:09 compute-3-92 automount[3007]: mount(nfs): mounted 10.1.1.162:/data_kzm on /misc/data_kzm Dec 2 10:25:10 compute-3-92 automount[2936]: handle_child: got pid 3007, sig 0 (0), stat 0 Dec 2 10:25:10 compute-3-92 automount[2936]: sig_child: found pending iop pid 3007: signalled 0 (sig 0), exit status 0 Dec 2 10:25:10 compute-3-92 automount[2936]: send_ready: token=133 Dec 2 10:27:55 compute-3-92 automount[2936]: sig 14 switching from 1 to 2 Dec 2 10:27:55 compute-3-92 automount[2936]: get_pkt: state 1, next 2 Dec 2 10:27:55 compute-3-92 automount[2936]: st_expire(): state = 1 Dec 2 10:27:55 compute-3-92 automount[2936]: expire_proc: exp_proc=3041 Dec 2 10:27:55 compute-3-92 automount[3041]: expire_proc: 1 remaining in /misc Dec 2 10:27:55 compute-3-92 automount[2936]: handle_child: got pid 3041, sig 0 (0), stat 1 Dec 2 10:27:55 compute-3-92 automount[2936]: sigchld: exp 3041 finished, switching from 2 to 1 Dec 2 10:27:55 compute-3-92 automount[2936]: get_pkt: state 2, next 1 Dec 2 10:27:55 compute-3-92 automount[2936]: st_ready(): state = 2
Hmm, the log shows only a successful mount. Did this run include the "No such file or directory" error? In other words, did you trigger the problem while debugging was enabled?
I think that it did, but just in case, I've just run it again to make sure. This definintely gives the "No such file or directory" error and when I ls -l /misc/data_kzm/cw, I get the following: total 36 drwxrwxr-x 2 cw cw 4096 Dec 2 15:16 RESTART drwxrwxr-x 2 cw cw 4096 Nov 29 14:52 OBJ drwxrwxr-x 5 cw cw 8192 Nov 16 17:02 .. drwxrwxr-x 5 cw cw 4096 Nov 15 17:29 . ?--------- ? ? ? ? ? DATA3D while df -h gives [root@compute-3-60 ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 9.7G 2.3G 6.9G 25% / none 1005M 0 1005M 0% /dev/shm /dev/sda3 63G 33M 60G 1% /state/partition1 swell.local:/export/home/cw 326G 247G 63G 80% /home/cw 10.1.1.162:/data_kzm 2.7T 2.1T 561G 79% /misc/data_kzm Here's the log from this run: Dec 2 15:44:44 compute-3-60 automount[19389]: starting automounter version 4.1.3-1 55, path = /misc, maptype = file, mapname = /etc/auto.misc Dec 2 15:44:44 compute-3-60 automount[19389]: mount(bind): bind_works = 1 Dec 2 15:44:44 compute-3-60 automount[19389]: using kernel protocol version 4.05 Dec 2 15:44:44 compute-3-60 automount[19389]: using timeout 600 seconds; freq 150 secs Dec 2 15:46:04 compute-3-60 automount[19389]: handle_packet: type = 0 Dec 2 15:46:04 compute-3-60 automount[19389]: handle_packet_missing: token 111, na me data_kzm Dec 2 15:46:04 compute-3-60 automount[19389]: attempting to mount entry /misc/data _kzm Dec 2 15:46:04 compute-3-60 automount[19455]: lookup(file): data_kzm -> 10.1.1.162:/data_kzm Dec 2 15:46:04 compute-3-60 automount[19455]: parse(sun): expanded entry: 10.1.1.162:/data_kzm Dec 2 15:46:04 compute-3-60 automount[19455]: parse(sun): gathered options: Dec 2 15:46:04 compute-3-60 automount[19455]: parse(sun): dequote("10.1.1.162:/data_kzm") -> 10.1.1.162:/data_kzm Dec 2 15:46:04 compute-3-60 automount[19455]: parse(sun): core of entry: options=, loc=10.1.1.162:/data_kzm Dec 2 15:46:04 compute-3-60 automount[19455]: parse(sun): mounting root /misc, mountpoint data_kzm, what 10.1.1.162:/data_kzm, fstype nfs, options <NULL> Dec 2 15:46:04 compute-3-60 automount[19455]: mount(nfs): root=/misc name=data_kzm what=10.1.1.162:/data_kzm, fstype=nfs, options=<NULL> Dec 2 15:46:04 compute-3-60 automount[19455]: mount(nfs): is_bad_host: 10.1.1.162:/data_kzm Dec 2 15:46:04 compute-3-60 automount[19455]: mount(nfs): is_local_mount: 10.1.1.162:/data_kzm Dec 2 15:46:04 compute-3-60 automount[19455]: mount(nfs): from 10.1.1.162:/data_kzm elected 10.1.1.162:/data_kzm Dec 2 15:46:04 compute-3-60 automount[19455]: mount(nfs): calling mkdir_path /misc/data_kzm Dec 2 15:46:04 compute-3-60 automount[19455]: mount(nfs): calling mount -t nfs 10.1.1.162:/data_kzm /misc/data_kzm Dec 2 15:46:04 compute-3-60 automount[19455]: mount(nfs): mounted 10.1.1.162:/data_kzm on /misc/data_kzm Dec 2 15:46:04 compute-3-60 automount[19389]: handle_child: got pid 19455, sig 0 (0), stat 0 Dec 2 15:46:04 compute-3-60 automount[19389]: sig_child: found pending iop pid 19455: signalled 0 (sig 0), exit status 0 Dec 2 15:46:04 compute-3-60 automount[19389]: send_ready: token=111
I sent the message too quickly...more appeared shortly after: Dec 2 15:47:53 compute-3-60 automount[19389]: sig 14 switching from 1 to 2 Dec 2 15:47:53 compute-3-60 automount[19389]: get_pkt: state 1, next 2 Dec 2 15:47:53 compute-3-60 automount[19389]: st_expire(): state = 1 Dec 2 15:47:53 compute-3-60 automount[19471]: expire_proc: 1 remaining in /misc Dec 2 15:47:53 compute-3-60 automount[19389]: expire_proc: exp_proc=19471 Dec 2 15:47:53 compute-3-60 automount[19389]: handle_child: got pid 19471, sig 0 ( 0), stat 1 Dec 2 15:47:53 compute-3-60 automount[19389]: sigchld: exp 19471 finished, switchi ng from 2 to 1 Dec 2 15:47:53 compute-3-60 automount[19389]: get_pkt: state 2, next 1 Dec 2 15:47:53 compute-3-60 automount[19389]: st_ready(): state = 2 Dec 2 15:50:23 compute-3-60 automount[19389]: sig 14 switching from 1 to 2 Dec 2 15:50:23 compute-3-60 automount[19389]: get_pkt: state 1, next 2 Dec 2 15:50:23 compute-3-60 automount[19389]: st_expire(): state = 1 Dec 2 15:50:23 compute-3-60 automount[19389]: expire_proc: exp_proc=19491 Dec 2 15:50:23 compute-3-60 automount[19491]: expire_proc: 1 remaining in /misc Dec 2 15:50:23 compute-3-60 automount[19389]: handle_child: got pid 19491, sig 0 ( 0), stat 1 Dec 2 15:50:23 compute-3-60 automount[19389]: sigchld: exp 19491 finished, switchi ng from 2 to 1 Dec 2 15:50:23 compute-3-60 automount[19389]: get_pkt: state 2, next 1 Dec 2 15:50:23 compute-3-60 automount[19389]: st_ready(): state = 2 Dec 2 15:52:53 compute-3-60 automount[19389]: sig 14 switching from 1 to 2 Dec 2 15:52:53 compute-3-60 automount[19389]: get_pkt: state 1, next 2 Dec 2 15:52:53 compute-3-60 automount[19389]: st_expire(): state = 1 Dec 2 15:52:53 compute-3-60 automount[19389]: expire_proc: exp_proc=19504 Dec 2 15:52:53 compute-3-60 automount[19504]: expire_proc: 1 remaining in /misc Dec 2 15:52:53 compute-3-60 automount[19389]: handle_child: got pid 19504, sig 0 ( 0), stat 1 Dec 2 15:52:53 compute-3-60 automount[19389]: sigchld: exp 19504 finished, switchi ng from 2 to 1 Dec 2 15:52:53 compute-3-60 automount[19389]: get_pkt: state 2, next 1 Dec 2 15:52:53 compute-3-60 automount[19389]: st_ready(): state = 2 Dec 2 15:55:23 compute-3-60 automount[19389]: sig 14 switching from 1 to 2 Dec 2 15:55:23 compute-3-60 automount[19389]: get_pkt: state 1, next 2 Dec 2 15:55:23 compute-3-60 automount[19389]: st_expire(): state = 1 Dec 2 15:55:23 compute-3-60 automount[19389]: expire_proc: exp_proc=19522 Dec 2 15:55:23 compute-3-60 automount[19522]: expire_proc: 1 remaining in /misc Dec 2 15:55:23 compute-3-60 automount[19389]: handle_child: got pid 19522, sig 0 ( 0), stat 1 Dec 2 15:55:23 compute-3-60 automount[19389]: sigchld: exp 19522 finished, switchi ng from 2 to 1 Dec 2 15:55:23 compute-3-60 automount[19389]: get_pkt: state 2, next 1 Dec 2 15:55:23 compute-3-60 automount[19389]: st_ready(): state = 2
Hello, Is this issue still receiving attention? Is there a more appropriate forum for this particular problem?
As stated on the bugzilla main page: "If you are a Red Hat Enterprise Linux customer and have an active support entitlement, you can log in to Red Hat Support for assistance with your issue." Aside from that, yes, this is the right place to file a bug. Another place to look for help is the autofs mailing list: autofs.org. This issue is on my radar, but there are several things which are higher priority at the moment.
Ian, does this stir up any memories?
I think in order to proceed on this, we'll need to see some debugging output from the kernel side. Are you capable of building your own kernels? If so, modify the fs/autofs4/autofs_i.h file to enable debugging. (you can uncomment the line that says #define DEBUG). Then rebuild the kernel. If you are not comfortable building your own kernels, then please let me know what architecture and kernel version you are using (i.e. i686 smp). The output of uname -a will have enough information for me to derive the version. Once we get you a test kernel, then I'll need you to reproduce the problem. I tried in-house, but was unable to. I'll need the debug output from the kernel module and the user-space daemon together. All of this should be in the debug log you've setup.
I'm still waiting for a response to commnet #7.
I still can't reproduce this locally. Can you try with an updated kernel? We did fix some corner cases there that may or may not apply here. Thanks!
I have not been able to reproduce the broken metadata shown in the ls -l output from the original problem report. However, I was able to uncover a bug in the autofs kernel code dealing with concurrent expire and mount requests. If a mount request comes in for a file system that is currently being expired, then it blocks in try_to_fill_dentry, here: /* Block on any pending expiry here; invalidate the dentry when expiration is done to trigger mount request with a new dentry */ if (ino && (ino->flags & AUTOFS_INF_EXPIRING)) { DPRINTK("waiting for expire %p name=%.*s", dentry, dentry->d_name.len, dentry->d_name.name); status = autofs4_wait(sbi, dentry, NFY_NONE); DPRINTK("expire done status=%d", status); /* * If the directory still exists the mount request must * continue otherwise it can't be followed at the right * time during the walk. */ status = d_invalidate(dentry); if (status != -EBUSY) return -ENOENT; The top-most comment implies that we want to return a status code to the caller that will cause it to retry the lookup. However, returning -ENOENT, as is done here, will not illicit the desired behaviour from the VFS. The reason is that autofs4_revalidate simply passes the return code along to its caller, do_revalidate: static inline struct dentry *do_revalidate(struct dentry *dentry, struct nameidata *nd) { int status = dentry->d_op->d_revalidate(dentry, nd); if (unlikely(status <= 0)) { /* * The dentry failed validation. * If d_revalidate returned 0 attempt to invalidate * the dentry otherwise d_revalidate is asking us * to return a fail status. */ if (!status) { if (!d_invalidate(dentry)) { dput(dentry); dentry = NULL; } } else { dput(dentry); dentry = ERR_PTR(status); } } return dentry; } Notice that status will be -ENOENT, and so ERR_PTR(-ENOENT) will be returned to do_lookup, which does this: need_revalidate: if (atomic) return -EWOULDBLOCKIO; dentry = do_revalidate(dentry, nd); if (!dentry) goto need_lookup; if (IS_ERR(dentry)) goto fail; goto done; fail: return PTR_ERR(dentry); And so, instead of triggerring a lookup with a new dentry as was intended, we end up returning an error code all the way back to userspace! What we really wanted to do, way back in try_to_fill_dentry, was to return 0. In this case, do_revalidate would return NULL, which would cause do_lookup to call autofs4's _lookup function which will then trigger a new mount. Ian, can you think of any reason we should be returning -ENOENT there? Will changing that to a zero result in any side-effects? It doesn't look like it from my reading of the code, but I'd appreciate your eyes on it.
Here's how I reproduced this problem: auto.master: /home2 /etc/auto.home2 --timeout=5 --debug /misc /etc/auto.misc --timeout=5 --debug auto.home2: cw localhost:/export/cw auto.misc: data_kzm localhost:/export/data_kzm # ls -l /export/cw total 0 lrwxrwxrwx 1 root root 24 Jan 16 13:37 DATA3D -> /misc/data_kzm/cw/DATA3D # ls -l /export/data_kzm total 4 drwxr-xr-x 3 root root 4096 Jan 16 13:41 cw # ls -l /export/data_kzm/cw total 4 drwxr-xr-x 2 root root 4096 Jan 16 13:41 DATA3D Start up the autofs service, and run this script: ---[cut here]--- #!/bin/bash if [ $# -ne 1 ]; then echo "Usage: breakme.sh <outputfile>" exit fi date > $1 while true; do dd if=/dev/zero of=/home2/cw/DATA3D/outfile bs=1M count=1 >&/dev/null if [ $? -ne 0 ]; then break; fi; sleep 6; done date >> $1 mount >> $1 sleep 1; service autofs stop ---[cut here]--- The script will exit when it runs into the problem. On my test system, it takes anywhere between 30 seconds and a couple of hours to reproduce. Later kernels reproduce the problem more in the 30 second range.
Created attachment 146977 [details] Return a proper error code from try_to_fill_dentry after waiting for an expire event This patch is what I am currently testing.
(In reply to comment #12) > Ian, can you think of any reason we should be returning -ENOENT there? Will > changing that to a zero result in any side-effects? It doesn't look like it > from my reading of the code, but I'd appreciate your eyes on it. No, with the latest v5 patch set that has become a bug. It is a case that I missed when making the change to allow revalidate to return an error code instead of true or false. Previously it didn't matter what was returned. So that should work correctly then, does it? Ian
(In reply to comment #15) > (In reply to comment #12) > > Ian, can you think of any reason we should be returning -ENOENT there? Will > > changing that to a zero result in any side-effects? It doesn't look like it > > from my reading of the code, but I'd appreciate your eyes on it. > > No, with the latest v5 patch set that has become a bug. > It is a case that I missed when making the change to allow > revalidate to return an error code instead of true or false. > Previously it didn't matter what was returned. > > So that should work correctly then, does it? It would be instructive to DPRINTK the d_count of the dentry. If should be 1 for the case we're trying to identify (non-browse mount whose directory has been removed), if it's not then we have more work to do. Ian
Created attachment 147032 [details] Fix the return codes passed back to the VFS after waiting for a pending expire The last patch did not propagate the 0 return code all the way to the VFS. It was instead converted to a 1, which would still illicit the problem being observed. This patch fixes autofs4_revalidate to recognize a new, special return code that requires the VFS to perform another lookup. This patch is doing well in testing thus far.
(In reply to comment #18) > Created an attachment (id=147032) [edit] > Fix the return codes passed back to the VFS after waiting for a pending expire > > The last patch did not propagate the 0 return code all the way to the VFS. It > was instead converted to a 1, which would still illicit the problem being > observed. This patch fixes autofs4_revalidate to recognize a new, special > return code that requires the VFS to perform another lookup. > > This patch is doing well in testing thus far. Investigation of this bug lead to the discovery of a race between mount and expire in addition to the return code issue above. A patch to resolve this is attached. Ian
Created attachment 152874 [details] Patch to resolve race between mount and expire What happens is that during an expire the situation can arise that a directory is removed and another lookup is done before the expire issues a completion status to the kernel module. In this case, since the the lookup gets a new dentry, it doesn't know that there is an expire in progress and when it posts its mount request, matches the existing expire request and waits for its completion. ENOENT is then returned to user space from lookup (as the dentry passed in is now unhashed) without having performed the mount request. The solution is to keep track of dentrys in this unhashed state and reuse them, if possible, in order to preserve the flags. Ian
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
This request was evaluated by Red Hat Kernel Team for inclusion in a Red Hat Enterprise Linux maintenance release, and has moved to bugzilla status POST.
committed in stream U6 build 55.3. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/
*** Bug 240095 has been marked as a duplicate of this bug. ***
(In reply to comment #29) > Ian, > > I believe we can use the same packages as the previous hotfix request. > Do you agree? Yes, but I have a fix for a deadlock in the alarm handler which is not in CVS yet. We've had only one report of this so far but I expect as people start deployments we'll see more of it. The problem is that autofs can run for several weeks before the problem shows up. I don't have a qa_ack for the bug yet and I think there may not be enough qa resources to verify all the potential 5.1 bugs. I will be committing this fix tomorrow and hope it gets into 5.1 and I'd recommend using this revision for a hotfix to save us having to possibly revisit what will appear to the customer as the same issue. In fact we should also advise any other customers that have the kernel hotfix (there were two I think). Can you wait till Monday? Can you help to make sure the customers know about the additional hotfix (the other is for autofs itself rather than the kernel). Ian
Hi Ian, No problem, we can wait until monday. -Flavio Issue escalated to RHEL 4 Tools by: fleitner. Internal Status set to 'Waiting on Engineering' This event sent from IssueTracker by fleitner issue 123073
Okay -- I'll re-close this ticket. Thanks again! Internal Status set to 'Resolved' Status set to: Closed by Client Resolution set to: 'Closed by Client' This event sent from IssueTracker by jmbastia issue 120565
Hi Ian, I need the packages to release a hotfix (scheduled for last monday). Do you already have them? or any idea when they will be ready? thanks, -Flavio Internal Status set to 'Waiting on Engineering' This event sent from IssueTracker by fleitner issue 123073
(In reply to comment #33) > Hi Ian, > > I need the packages to release a hotfix (scheduled for last monday). > Do you already have them? or any idea when they will be ready? Yes, I forgot about this. Sorry, but I was wrong in my statement above as this is RHEL4 with autofs v4 and I was talking about the v5 user space package above. I believe this packages mentioned here should be OK for the hotfix. Ian
Thank you Chris, Customer is satisfied with the hotfix until U6 delivery. I'm closing this ticket. Regards, Yves. Internal Status set to 'Resolved' Status set to: Closed by Client This event sent from IssueTracker by yves.begrand issue 123073
have tested the bug on 2.6.9-55.0.2 and 2.6.9-55.0.4, the bug is produced successfully on 2.6.9-55.0.2 and did not happen on 2.6.9-55.0.4. so the bug should have been fixed.
Created attachment 161842 [details] Patch to fix wakeup order of processes when rehashing dentry I've discovered a problem with the patch for this issue. It is related to the wakeup order of waiting processes. Ian
Created attachment 173321 [details] Patch to sync autofs4 with upstream There is a risk of some confusion regarding various patches. In order to be able to use the same patches everywhere we need to sync the source with the various kernels with upstream. This patch brings the RHEL 4 kernel in line with upstream.
Created attachment 173361 [details] Patch to fix issue reported during Z-Stream update QA This patch fixes a fail reported during QA testing for a Z-Stream update requesting this patch. It is in fact a hunk from another autofs4 patch that resolves a deadlock during directory creation under load (see bug #246530 for info). The deadlock patch delays hashing of dentrys at directory creation until the actual create operation and so dentrys remain unhashed for a relatively long time so the code in this patch was needed their. With the expire/mount race fix here, dentrys are unhashed for a relatively brief time so the code in this patch was not identified as needed during development. However, if there are many process concurrently accessing directories it's possible there will be two or more waiters in the queue. Only one of the waiters will have the dentry required to complete the lookup and the others need to perform a d_lookup to get the correct dentry. This patch allows these processes to perform the needed d_lookup.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0791.html