This bugzilla is coming from #455408, where we decided to split out some specific problems. For more details, traces etc. please look at #455408 and copy over where needed. This bugzilla is meant to handle the signalling issues identified: * The close() call is pretty clearly broken. We should never return -ERESTARTSYS to userspace. The close is almost certainly taking a long time because the client is flushing pages to the server before it can (or should) return. * It appears that the client is not handling NFS3ERR_JUKEBOX correctly and is treating it as a normal error in some path. That's a bug and we should address that. * the linux client is not handling signals in the way it should be during open/close. close() should not return -ERESTARTSYS. There's also the question of why we get a valid fd when the process is signaled while hung on an open() call (perhaps the RPC LOOKUP call was reissued and returned immediately?). This is probably related to the JUKEBOX errors that Peter mentions.
Made an attempt to reproduce this today, based on what we know about this NFS server. I modified a RHEL4 kernel so that I could make it always drop LOOKUP requests and to return NFS3ERR_JUKEBOX on any write WRITE call, whenever a particular nfsd_debug flag is set. The server is working as expected with this, but I can't seem to reproduce the problem -- if I ^c the dd or cp, the signal is ignored. I think I've been able to reproduce it, but I had to add the "intr" option to the mount (comment #1 in bug 455408 does not show that they were using "intr"): --------[snip]-------- 5679 16:04:59.967177 read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 <0.000019> 5679 16:04:59.967337 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 <0.000027> 5679 16:04:59.967517 close(0) = 0 <0.000024> 5679 16:04:59.967676 close(1) = ? ERESTARTSYS (To be restarted) <11.982816> 5679 16:05:11.950666 --- SIGINT (Interrupt) @ 0 (0) --- 5679 16:05:11.950798 close(1) = -1 EBADF (Bad file descriptor) <0.000019> --------[snip]-------- ...even worse, after I do this, I'm unable to unmount the share: # umount /mnt/rhel4 umount: /mnt/rhel4: device is busy umount: /mnt/rhel4: device is busy ...so this seems to cause some sort of refcounting bug as well.
...I can also reproduce this by just having iptables on the server drop traffic. I'll crank up nfs_debug next and see if I can tell what's causing it.
Actually, there is no refcounting bug. Once all of the pages for the file are flushed, I'm able to unmount. With so many calls in flight, it's hard to tell what's happening when the close returns -ERESTARTSYS, however. I may have to see whether I can reproduce this with a different call and then hope that the same fix works for the write/close here...
First, I should probably mention that I don't see *any* evidence of a problem unless "intr" is used as a mount option. That option should generally never be used if they care at all about data integrity. This problem should just go away if they eliminate that option. It's worthwhile to note that the intr/nointr options are now deprecated upstream. Since RHEL5 still has it though, I'll see if this can reasonably be fixed... I cranked up nfs_debug and rpc_debug, but the info in it wasn't particularly helpful. Here's a stack trace from the dd command when it's stuck in the blocked close() call: dd S ffff810001968728 0 2610 2609 (NOTLB) ffff81000e07bd28 0000000000000046 ffffffff805ab210 ffff81000e0e6040 0000000000000000 0000000000000004 ffff81000e0e6040 ffff810001be2140 000003286c830f7f 000000000723e6ab ffff81000e0e6228 0000000100000046 Call Trace: [<ffffffff800a57c6>] lock_release_holdtime+0x27/0x48 [<ffffffff800688ed>] _spin_unlock_irqrestore+0x3e/0x44 [<ffffffff884711e9>] :nfs:nfs_wait_bit_interruptible+0x0/0x28 [<ffffffff8847120b>] :nfs:nfs_wait_bit_interruptible+0x22/0x28 [<ffffffff80066ac5>] __wait_on_bit+0x40/0x6e [<ffffffff884711e9>] :nfs:nfs_wait_bit_interruptible+0x0/0x28 [<ffffffff80066b5f>] out_of_line_wait_on_bit+0x6c/0x78 [<ffffffff800a2c4c>] wake_bit_function+0x0/0x23 [<ffffffff884711cf>] :nfs:nfs_wait_on_request+0x56/0x70 [<ffffffff88474c81>] :nfs:nfs_wait_on_requests_locked+0x6b/0xc5 [<ffffffff88475c92>] :nfs:nfs_sync_inode_wait+0x59/0x1ce [<ffffffff8846bef5>] :nfs:nfs_do_fsync+0x1f/0x3f [<ffffffff8846c660>] :nfs:nfs_file_flush+0x53/0x76 [<ffffffff80024f96>] filp_close+0x36/0x64 [<ffffffff8001ebc9>] sys_close+0x8f/0xaa [<ffffffff800602a6>] tracesys+0xd5/0xdf I'll see if I can determine from where the -ERESTARTSYS is coming based on that...
Given that we've been telling people, for a long time, to use "intr", perhaps it would be good to explain what "intr" does and then why it is bad to use. We've been telling people, for a long time, to use "intr" because will keep an application from being stuck forever when waiting for a response from a server which is dead. It would allow the application to receive a signal and to be able to terminate. I think that we need to explain what is different now.
Ok, I see what's happening on the close call... When we signal the process, nfs_wait_bit_interruptible() returns -ERESTARTSYS which bubbles back up to userspace in the close() call. But even when it returns error, the file descriptor is still torn down. Userspace apparently receives the -ERESTARTSYS and retries the close() call, but by then the fd is no good anymore so that returns -EBADF. The simplest fix is simply to make nfs_file_flush return -EINTR instead of -ERESTARTSYS, but maybe that's better handled in filp_close() itself? I'm not sure if -EINTR is a legit return code for fsync().
As far as what's changed upstream... It looks like the TASK_KILLABLE infrastructure that was added there allowed us to get rid of intr upstream. I'm not too familiar with what that change does, so I'll need to read up on it. I may be overstating the danger of using "intr" though I recall Chuck Lever being insistent that it was dangerous. You do have to deal with the fact that if your programs are signaled and then return then you don't necessarily know what the outcome of your system call was, but that's probably not that big a problem with some apps...
Created attachment 312564 [details] patch -- fix wrong error code on interrupted close() syscalls This patch (by Ernie) looks like what we need to fix the close() calls. It just changes the -ERESTARTSYS (and a few other return codes that close() shouldn't return) to -EINTR. I'll build a kernel with it and give it a test in a bit.
LWN has an article on TASK_KILLABLE: http://lwn.net/Articles/288056/ ...looks like it's basically an "intermediate" state between TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE that only allows fatal signals to interrupt it. It's a good idea, but not something we can reasonably add in RHEL5. The patch in comment #8 will probably fix the problems with close() calls that they've seen, but it probably won't do anything for the open() codepath. I've had a lot harder time reproducing the problems they saw with open calls.
I think I understand what happened with the open() call as well. Here's a strace that I just generated which replicates what they were seeing: 6609 13:43:09.564628 lseek(0, 0, SEEK_CUR) = 0 <0.000024> 6609 13:43:09.564807 close(1) = 0 <0.000025> 6609 13:43:09.564973 open("/mnt/rhel4/testfile", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 1 <51.522328> 6609 13:44:01.087699 --- SIGINT (Interrupt) @ 0 (0) --- 6609 13:44:01.087883 rt_sigaction(SIGUSR1, NULL, {SIG_DFL}, 8) = 0 <0.000026> 6609 13:44:01.088080 rt_sigaction(SIGINT, NULL, {SIG_IGN}, 8) = 0 <0.000025> ...to get this, I had to mount the server with "nointr", and then had to have the server go unresponsive before trying to open the file (I just set some iptables rules on the server to drop traffic). The open call will generate some LOOKUP RPC's, which are not getting responses. If I then hit ^c on the file, the process will be sent a SIGINT, but because this is a nointr mount, the signal is masked off. A little while after I send the signal, I unblock the traffic to the server. The RPC call then completes and the open() returns a valid fd. The signal is then delivered to the program since the mask is changed. My hypothesis is that the strace here: https://bugzilla.redhat.com/show_bug.cgi?id=455408#c5 ...was generated while working with a mount that did not use the "intr" option. The server was unresponsive to LOOKUP calls and the process was signaled while in an open() syscall. Eventually, the server did respond, the open() call returned a valid fd and then the signal was delivered. If this is the case, then I don't think we have a bug in the open() call. The concern was based on a misinterpretation of the info in that strace. The -ERESTARTSYS from close() is a real problem however, so I'll concentrate on fixing that at this point.
As expected, the patch in comment #8 fixes this issue. I'll plan to submit this internally soon.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-101.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html