Bug 455729 - close system call returns -ERESTARTSYS
Summary: close system call returns -ERESTARTSYS
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.2
Hardware: All
OS: Linux
low
low
Target Milestone: rc
: ---
Assignee: Jeff Layton
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-07-17 13:18 UTC by Jan Wildeboer
Modified: 2011-12-26 02:38 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-01-20 20:24:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
patch -- fix wrong error code on interrupted close() syscalls (2.60 KB, patch)
2008-07-24 14:33 UTC, Jeff Layton
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 455408 0 medium CLOSED NFS perfomance problems on s390x 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHSA-2009:0225 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.3 kernel security and bug fix update 2009-01-20 16:06:24 UTC

Description Jan Wildeboer 2008-07-17 13:18:21 UTC
This bugzilla is coming from #455408, where we decided to split out some
specific problems.

For more details, traces etc. please look at #455408 and copy over where needed.

This  bugzilla is meant to handle the signalling issues identified:

* The close() call is pretty clearly broken. We should never return -ERESTARTSYS
to userspace. The close is almost certainly taking a long time because the
client is flushing pages to the server before it can (or should) return.

* It appears that the client is not handling NFS3ERR_JUKEBOX correctly
and is treating it as a normal error in some path.  That's a bug and
we should address that.

* the linux client is not handling signals in the way it should be during
open/close. close() should not return -ERESTARTSYS. There's also the question of
why we get a valid fd when the process is signaled while hung on an open() call
(perhaps the RPC LOOKUP call was reissued and returned immediately?). This is
probably related to the JUKEBOX errors that Peter mentions.

Comment 1 Jeff Layton 2008-07-23 20:08:20 UTC
Made an attempt to reproduce this today, based on what we know about this NFS
server. I modified a RHEL4 kernel so that I could make it always drop LOOKUP
requests and to return NFS3ERR_JUKEBOX on any write WRITE call, whenever a
particular nfsd_debug flag is set. The server is working as expected with this,
but I can't seem to reproduce the problem -- if I ^c the dd or cp, the signal is
ignored.

I think I've been able to reproduce it, but I had to add the "intr" option to
the mount (comment #1 in bug 455408 does not show that they were using "intr"):

--------[snip]--------
5679  16:04:59.967177 read(0,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) =
4096 <0.000019>
5679  16:04:59.967337 write(1,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) =
4096 <0.000027>
5679  16:04:59.967517 close(0)          = 0 <0.000024>
5679  16:04:59.967676 close(1)          = ? ERESTARTSYS (To be restarted)
<11.982816>
5679  16:05:11.950666 --- SIGINT (Interrupt) @ 0 (0) ---
5679  16:05:11.950798 close(1)          = -1 EBADF (Bad file descriptor) <0.000019>
--------[snip]--------

...even worse, after I do this, I'm unable to unmount the share:

# umount /mnt/rhel4
umount: /mnt/rhel4: device is busy
umount: /mnt/rhel4: device is busy

...so this seems to cause some sort of refcounting bug as well.


Comment 2 Jeff Layton 2008-07-23 20:17:32 UTC
...I can also reproduce this by just having iptables on the server drop traffic.
I'll crank up nfs_debug next and see if I can tell what's causing it.


Comment 3 Jeff Layton 2008-07-23 20:31:15 UTC
Actually, there is no refcounting bug. Once all of the pages for the file are
flushed, I'm able to unmount.

With so many calls in flight, it's hard to tell what's happening when the close
returns -ERESTARTSYS, however. I may have to see whether I can reproduce this
with a different call and then hope that the same fix works for the write/close
here...


Comment 4 Jeff Layton 2008-07-24 13:58:33 UTC
First, I should probably mention that I don't see *any* evidence of a problem
unless "intr" is used as a mount option. That option should generally never be
used if they care at all about data integrity. This problem should just go away
if they eliminate that option. It's worthwhile to note that the intr/nointr
options are now deprecated upstream.

Since RHEL5 still has it though, I'll see if this can reasonably be fixed...

I cranked up nfs_debug and rpc_debug, but the info in it wasn't particularly
helpful. Here's a stack trace from the dd command when it's stuck in the blocked
close() call:

dd            S ffff810001968728     0  2610   2609                     (NOTLB)
 ffff81000e07bd28 0000000000000046 ffffffff805ab210 ffff81000e0e6040
 0000000000000000 0000000000000004 ffff81000e0e6040 ffff810001be2140
 000003286c830f7f 000000000723e6ab ffff81000e0e6228 0000000100000046
Call Trace:
 [<ffffffff800a57c6>] lock_release_holdtime+0x27/0x48
 [<ffffffff800688ed>] _spin_unlock_irqrestore+0x3e/0x44
 [<ffffffff884711e9>] :nfs:nfs_wait_bit_interruptible+0x0/0x28
 [<ffffffff8847120b>] :nfs:nfs_wait_bit_interruptible+0x22/0x28
 [<ffffffff80066ac5>] __wait_on_bit+0x40/0x6e
 [<ffffffff884711e9>] :nfs:nfs_wait_bit_interruptible+0x0/0x28
 [<ffffffff80066b5f>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff800a2c4c>] wake_bit_function+0x0/0x23
 [<ffffffff884711cf>] :nfs:nfs_wait_on_request+0x56/0x70
 [<ffffffff88474c81>] :nfs:nfs_wait_on_requests_locked+0x6b/0xc5
 [<ffffffff88475c92>] :nfs:nfs_sync_inode_wait+0x59/0x1ce
 [<ffffffff8846bef5>] :nfs:nfs_do_fsync+0x1f/0x3f
 [<ffffffff8846c660>] :nfs:nfs_file_flush+0x53/0x76
 [<ffffffff80024f96>] filp_close+0x36/0x64
 [<ffffffff8001ebc9>] sys_close+0x8f/0xaa
 [<ffffffff800602a6>] tracesys+0xd5/0xdf

I'll see if I can determine from where the -ERESTARTSYS is coming based on that...




Comment 5 Peter Staubach 2008-07-24 14:15:24 UTC
Given that we've been telling people, for a long time, to use "intr",
perhaps it would be good to explain what "intr" does and then why it
is bad to use.

We've been telling people, for a long time, to use "intr" because
will keep an application from being stuck forever when waiting for
a response from a server which is dead.  It would allow the application
to receive a signal and to be able to terminate.

I think that we need to explain what is different now.

Comment 6 Jeff Layton 2008-07-24 14:17:47 UTC
Ok, I see what's happening on the close call...

When we signal the process, nfs_wait_bit_interruptible() returns -ERESTARTSYS
which bubbles back up to userspace in the close() call. But even when it returns
error, the file descriptor is still torn down. Userspace apparently receives the
-ERESTARTSYS and retries the close() call, but by then the fd is no good anymore
so that returns -EBADF.

The simplest fix is simply to make nfs_file_flush return -EINTR instead of
-ERESTARTSYS, but maybe that's better handled in filp_close() itself? I'm not
sure if -EINTR is a legit return code for fsync().


Comment 7 Jeff Layton 2008-07-24 14:21:52 UTC
As far as what's changed upstream...

It looks like the TASK_KILLABLE infrastructure that was added there allowed us
to get rid of intr upstream. I'm not too familiar with what that change does, so
I'll need to read up on it.

I may be overstating the danger of using "intr" though I recall Chuck Lever
being insistent that it was dangerous. You do have to deal with the fact that if
your programs are signaled and then return then you don't necessarily know what
the outcome of your system call was, but that's probably not that big a problem
with some apps...




Comment 8 Jeff Layton 2008-07-24 14:33:03 UTC
Created attachment 312564 [details]
patch -- fix wrong error code on interrupted close() syscalls

This patch (by Ernie) looks like what we need to fix the close() calls. It just
changes the -ERESTARTSYS (and a few other return codes that close() shouldn't
return) to -EINTR.

I'll build a kernel with it and give it a test in a bit.

Comment 9 Jeff Layton 2008-07-24 14:42:11 UTC
LWN has an article on TASK_KILLABLE:

http://lwn.net/Articles/288056/

...looks like it's basically an "intermediate" state between TASK_INTERRUPTIBLE
and TASK_UNINTERRUPTIBLE that only allows fatal signals to interrupt it. It's a
good idea, but not something we can reasonably add in RHEL5.

The patch in comment #8 will probably fix the problems with close() calls that
they've seen, but it probably won't do anything for the open() codepath. I've
had a lot harder time reproducing the problems they saw with open calls.


Comment 10 Jeff Layton 2008-07-24 17:53:51 UTC
I think I understand what happened with the open() call as well. Here's a strace
that I just generated which replicates what they were seeing:

6609  13:43:09.564628 lseek(0, 0, SEEK_CUR) = 0 <0.000024>
6609  13:43:09.564807 close(1)          = 0 <0.000025>
6609  13:43:09.564973 open("/mnt/rhel4/testfile", O_WRONLY|O_CREAT|O_TRUNC,
0666) = 1 <51.522328>
6609  13:44:01.087699 --- SIGINT (Interrupt) @ 0 (0) ---
6609  13:44:01.087883 rt_sigaction(SIGUSR1, NULL, {SIG_DFL}, 8) = 0 <0.000026>
6609  13:44:01.088080 rt_sigaction(SIGINT, NULL, {SIG_IGN}, 8) = 0 <0.000025>

...to get this, I had to mount the server with "nointr", and then had to have
the server go unresponsive before trying to open the file (I just set some
iptables rules on the server to drop traffic). The open call will generate some
LOOKUP RPC's, which are not getting responses.

If I then hit ^c on the file, the process will be sent a SIGINT, but because
this is a nointr mount, the signal is masked off. A little while after I send
the signal, I unblock the traffic to the server. The RPC call then completes and
the open() returns a valid fd. The signal is then delivered to the program since
the mask is changed.

My hypothesis is that the strace here:

    https://bugzilla.redhat.com/show_bug.cgi?id=455408#c5

...was generated while working with a mount that did not use the "intr" option.
The server was unresponsive to LOOKUP calls and the process was signaled while
in an open() syscall. Eventually, the server did respond, the open() call
returned a valid fd and then the signal was delivered.

If this is the case, then I don't think we have a bug in the open() call. The
concern was based on a misinterpretation of the info in that strace.

The -ERESTARTSYS from close() is a real problem however, so I'll concentrate on
fixing that at this point.


Comment 11 Jeff Layton 2008-07-24 19:35:24 UTC
As expected, the patch in comment #8 fixes this issue. I'll plan to submit this
internally soon.


Comment 13 RHEL Program Management 2008-07-24 19:53:57 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 15 Don Zickus 2008-07-31 00:49:34 UTC
in kernel-2.6.18-101.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 19 errata-xmlrpc 2009-01-20 20:24:05 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html


Note You need to log in before you can comment on or make changes to this bug.