Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 726452

Summary:	open() calls for files mounted via kerberized nfsv4 for a user with expired ticket hangs
Product:	Red Hat Enterprise Linux 6	Reporter:	prozaconstilts
Component:	nfs-utils	Assignee:	Steve Dickson <steved>
Status:	CLOSED NOTABUG	QA Contact:	yanfu,wang <yanwang>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	6.1
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-09-27 18:31:16 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description prozaconstilts 2011-07-28 16:32:13 UTC

Description of problem:

Opens of files located under a kerberized NFSv4 mount hang when the user owning the file has an expired (but existing) kerberos credential cache.

Version-Release number of selected component (if applicable):

nfs-utils-1.2.3.7.el6.x86_64
krb5-libs-1.9-9.el6_1.1.x86_64

How reproducible:

Easily by others in my environment...not too sure about what exactly is the underlying cause, so perhaps difficult to reproduce.

Steps to Reproduce:

-build a RHEL6 NFS client that mounts an NFS server via kerberized NFSv4.
-request a ticket with a short lifetime and renewal time
-nfs mount a directory you have access to, and cat any file you own
-wait until your ticket expires
-try to cat a file you own, or ssh into the server, or any other operation that will try to open a file you own

Actual results:

hangs indefinitely


Expected results:

returns permission denied


Additional info:

the rpcgssd downcall returns something different depending on expired vs. non-existant ccache:


with an expired cache:
   write(12, "?\t\0\0\0\0\0\0\0\0\0\0\201\377\377\377", 16) = 16
   | 00000  3f 09 00 00 00 00 00 00  00 00 00 00 81 ff ff ff  ?....... ........ |

without a cache:
   write(12, "?\t\0\0\0\0\0\0\0\0\0\0\363\377\377\377", 16) = 16
   | 00000  3f 09 00 00 00 00 00 00  00 00 00 00 f3 ff ff ff  ?.......  ........ |

I'm not surprised it writes something different...expired ccache vs. non-existant cache, but I'm unable to determine what receives the result of the downcall, and why it decides to hang...

My kerberos server is a 2008 R2 AD. My RHEL5 clients do not exhibit this bug against the same kerberos server and NFS server.

I can provide any conf files needed upon request.

Thanks!

Comment 2 prozaconstilts 2011-07-29 11:50:25 UTC

Actually, after conferring with my colleagues, I believe one of them may have done a better job identifying this problem. Here is a paste of his e-mail:

It looks like that guess may have been accurate. Here is the beginning
of the patchset designed specifically to make the kernel spin (with
exponential backoff) when access is requested after a TGT has expired.
The use case driving this was specifically long term jobs.
http://linux-nfs.org/pipermail/nfsv4/2010-January/012012.html

When someone deploys kerberized NFS, they usually will quickly run
across a major problem. As soon as their credentials expire, all
RPCs start failing with -EACCES errors. This makes it really
difficult to have any sort of long-running job since you have to
proactively kinit before your TGT expires. If you miss doing so,
then your job may start getting errors unexpectedly.

This patchset represents a first pass at fixing this. The idea here
is to distinguish between the situation where someone has an expired
credential cache and someone that has no credential cache at all. In
the latter case, we want to have the RPC return -EACCES (just like
it does today), in the former case we want to return a different
error that will make the NFS layer delay and retry the call instead
of erroring out (-EKEYEXPIRED).

This patchset is for the kernel patches. To make this work, gssd
will also need to be fixed to send different errors in these
situations. That patch will follow this set.

and here is the patch which actually causes the kernel to wait
http://linux-nfs.org/pipermail/nfsv4/2010-January/012014.html

If a KRB5 TGT ticket expires, we don't want to return an error
immediatel. If someone has a long running job and just forgets to
run "kinit" in time then this will make it fail.

Instead, we want to treat this situation as we would NFS4ERR_DELAY
and retry the upcall after delaying a bit with an exponential
backoff.

This patch just makes any place that would handle NFS4ERR_DELAY also
handle -EKEYEXPIRED the same way. In the future, we may want to be
more sophisticated however and handle hard vs. soft mounts
differently, or specify some upper limit on how long we'll wait for
a new TGT to be acquired.

There are some timeout checks in place in the RHEL6 kernel, but
they all seem to eventually loop into infinity, surprisingly even
if running 'soft'.

nfs4_handle_exception:
case -EKEYEXPIRED:
ret = nfs4_delay(server->client, &exception->timeout);
if (ret != 0)
break;

nfs4_async_handle_error
case -EKEYEXPIRED:
rpc_delay(task, NFS4_POLL_RETRY_MAX);
task->tk_status = 0;
return -EAGAIN;

At the moment, this basically seems to boil down to:

In RHEL5, a long running process that continued operating after the TGT
expired, would spontaneously be returned 'access denied' to read/write
data that it may have been using moments before. Unless the application
was reasonably well written, that generally meant it just crashed.
Depending on the situation, that could have easily resulted in corrupted
state.

In RHEL6, that same process after the TGT expired was basically blocked
on I/O, it would sit there and spin, waiting for the filesystem to be
available again. From a single dumb application's point of view, this is
a pretty good approach. It effectively gets stuck in I/O wait, and when
the TGT is finally renewed, it can continue processing as if nothing
happened. Obviously this can cause some confusion, and is not ideal for
really smart applications which might be able to recover in some other
manner (use other space, continue processing and write later, etc.) or
single threaded interactive applications which will seemingly just
freeze.

Given the two paths, I certainly see the draw of the RHEL6 approach,
particularly for someone who was not thinking of NFS mounted home
directories. Obviously with home directories involved (especially those
backing a graphical session) it tends to lock things up, but probably
results in less actual corruption then 5. Additionally, given a renewed
TGT in RHEL6, all state would conceivably be recoverable if not sitting
for long periods. That is probably less true when an NFS based home
directory with a graphical session is in play.

A few solutions are coming to mind, some far better than others.
Unfortunately the best choices are going to involve a fair amount of
work.

Comment 3 Steve Dickson 2011-09-27 18:31:16 UTC

I'm thinking we are probably not going to much further, in RHEL6,
than Jeff's upstream patches that deal with this problem. As stated
with RHEL 6 deals with expired better but not perfect for all
applications. So baring some unexpected break through in upstream
I am going to close this as NOTABUG