RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1056658 - Hangs when logging in on NFS4 /home
Summary: Hangs when logging in on NFS4 /home
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: nfs-utils
Version: 7.0
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Steve Dickson
QA Contact: JianHong Yin
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-01-22 16:29 UTC by Jeff Layton
Modified: 2016-03-09 02:36 UTC (History)
11 users (show)

Fixed In Version: nfs-utils-1.2.9-3.1.el7
Doc Type: Bug Fix
Doc Text:
Clone Of: 1052902
: 1223661 (view as bug list)
Environment:
Last Closed: 2014-06-13 09:59:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Jeff Layton 2014-01-22 16:29:08 UTC
+++ This bug was initially created as a clone of Bug #1052902 +++

Description of problem:

When trying to log in as a user having his $HOME on NFS4, the related processes (kdm, login) go into D state and do not recover automatically.

I can recover by

  * su user -s /bin/kinit
  * kill -9 <D-processes>
  * service rpcgssd restart

but this this not work 100%.

Things seem to work after this, but when ticket expires processes (especially firefox (--> file locks?)) are stuck in D state and a reboot is required.


Version-Release number of selected component (if applicable):

nfs-utils-1.2.9-2.1.fc20.x86_64  (from koji)
kernel-3.12.6-300.fc20.x86_64


How reproducible:

100%

--- Additional comment from Enrico Scholz on 2014-01-14 06:16:27 EST ---



--- Additional comment from nmorey on 2014-01-21 08:43:17 EST ---

I am not sure if this is related but I updated recently (and got nfs-utils 1.2.9-2.1) and couldn't log with any Kerberos user. I managed to log as root and run
strace -f ksu <mylogin>

It showed that it stuck while loading my bashrc.
All the while:
ksu <mylogin> -e /bin/ls / 
worked perfectly.

After looking a long time, I simply downgraded nfs-utils back to 1.2.8.6-0 and everything is working again !

Nothing showed in the log neither on client nor server.

--- Additional comment from J. Bruce Fields on 2014-01-21 09:49:35 EST ---

OK, so this is a regression that was introduced somewhere between 1.2.8.6-0 and 1.2.9-2.1.

Wait, actually here's something interesting:

rpc.gssd        D ffff88033e394540     0  2657   2652 0x00000000
 ffff8803134bf858 0000000000000086 ffff8803134bffd8 0000000000014540
 ffff8803134bffd8 0000000000014540 ffff88032aa9a940 ffff88032a63fec8
 ffff880313c04520 ffff8803204c8d80 ffff88032b82c300 ffff88032aa9a940
Call Trace:
 [<ffffffff81667d89>] schedule+0x29/0x70
 [<ffffffffa01a1357>] gss_cred_init+0x277/0x3a0 [auth_rpcgss]
 [<ffffffff8108be90>] ? wake_up_atomic_t+0x30/0x30
 [<ffffffffa01694f9>] rpcauth_lookup_credcache+0x179/0x240 [sunrpc]
 [<ffffffffa015cb30>] ? call_retry_reserve+0x60/0x60 [sunrpc]
 [<ffffffffa019f52e>] gss_lookup_cred+0xe/0x10 [auth_rpcgss]
 [<ffffffffa016a3dc>] generic_bind_cred+0x1c/0x20 [sunrpc]
 [<ffffffffa0169c79>] rpcauth_refreshcred+0x99/0x1c0 [sunrpc]
 [<ffffffffa0160a4a>] ? xprt_lock_and_alloc_slot+0x6a/0x80 [sunrpc]
 [<ffffffffa015c7d0>] ? call_bc_transmit+0x160/0x160 [sunrpc]
 [<ffffffffa015c7d0>] ? call_bc_transmit+0x160/0x160 [sunrpc]
 [<ffffffffa015cb30>] ? call_retry_reserve+0x60/0x60 [sunrpc]
 [<ffffffffa015cb30>] ? call_retry_reserve+0x60/0x60 [sunrpc]
 [<ffffffffa015cb6c>] call_refresh+0x3c/0x70 [sunrpc]
 [<ffffffffa0166fc4>] __rpc_execute+0x84/0x400 [sunrpc]
 [<ffffffffa016861e>] rpc_execute+0x5e/0xa0 [sunrpc]
 [<ffffffffa015eeb0>] rpc_run_task+0x70/0x90 [sunrpc]
 [<ffffffffa05fa5f6>] nfs4_call_sync_sequence+0x56/0x80 [nfsv4]
 [<ffffffffa05fc3c7>] _nfs4_proc_access+0x107/0x190 [nfsv4]
 [<ffffffffa060147b>] nfs4_proc_access+0x4b/0xb0 [nfsv4]
 [<ffffffffa05c54b6>] nfs_do_access+0x226/0x2b0 [nfs]
 [<ffffffffa016990f>] ? rpcauth_lookupcred+0x7f/0xd0 [sunrpc]
 [<ffffffffa05c5643>] nfs_permission+0xd3/0x190 [nfs]
 [<ffffffff811b9c34>] __inode_permission+0x64/0xb0
 [<ffffffff811b9c98>] inode_permission+0x18/0x50
 [<ffffffff811ba32a>] link_path_walk+0x24a/0x850
 [<ffffffff812a1c06>] ? security_file_alloc+0x16/0x20
 [<ffffffff811be12c>] path_openat+0x9c/0x660
 [<ffffffff8154c6ef>] ? sock_destroy_inode+0x2f/0x40
 [<ffffffff811beeaa>] do_filp_open+0x3a/0x90
 [<ffffffff811caa3d>] ? __alloc_fd+0x7d/0x120
 [<ffffffff811ad9ce>] do_sys_open+0x12e/0x210
 [<ffffffff811adace>] SyS_open+0x1e/0x20

So rpc.gssd is trying to access an nfs filesystem.  That's an obvious deadlock.  Where exactly are nfs filesystems mounted?  I wonder if there's some reason recent gssd is trying to look at files under /home....

--- Additional comment from nmorey on 2014-01-21 09:56:09 EST ---

My home in fact under nfs4+krb. I can run some diagnosis if specific traces are required.

On a slightly different note, what is redhat process to validate this?
I've been running on a nfs+krb home since FC14 and I've seen a lot of bugs in the last 4 years from kernel crash to yum not running in home dir.
Isn't there a way to "intensify" the testing on nfs+krb? I try to be an early tester but as it's'my workstation, I can't really use the rawhide version.

--- Additional comment from Enrico Scholz on 2014-01-21 10:03:08 EST ---

NFS4 is mounted at /home (without 'comment=systemd.automount' fstab option, but this does not make an (obvious) difference).

fwiw, I switched to testing nfs-utils because NFS4 hangs with 1.2.8.6 after a while (here, firefox seems to be involved too).

--- Additional comment from J. Bruce Fields on 2014-01-21 10:40:11 EST ---

(In reply to nmorey from comment #4)
> My home in fact under nfs4+krb. I can run some diagnosis if specific traces
> are required.

I wonder if you could get an strace of gssd?  So, you'd do something like:

 - upgrade gssd, or whatever you need to do to ensure the problem will reproduce reliably.
 - probably reboot
 - strace -oOUTPUT -p$(pidof rpc.gssd)
 - try to log in as a user, wait for the deadlock.

Then attach OUTPUT.  Mainly what I'm curious about is what file gssd is trying to open.

> On a slightly different note, what is redhat process to validate this?
> I've been running on a nfs+krb home since FC14 and I've seen a lot of bugs
> in the last 4 years from kernel crash to yum not running in home dir.
> Isn't there a way to "intensify" the testing on nfs+krb? I try to be an
> early tester but as it's'my workstation, I can't really use the rawhide
> version.

Good question, but I'm not sure who to ask.  Googling around.... I guess I'd start by poking around under https://fedoraproject.org/wiki/QA and try to find out how to add new "home on secure nfs" test case.  I note there's some nfs test cases under https://fedoraproject.org/wiki/Category:NFS_Test_Cases, and there's also https://fedoraproject.org/wiki/QA:SOP_test_case_creation.

--- Additional comment from Enrico Scholz on 2014-01-21 10:42:22 EST ---

I guess you need 'strace -f'; afair, the D rpcgssd process was a child process of the main one.

--- Additional comment from Jeff Layton on 2014-01-21 11:09:21 EST ---

(cc'ing Simo in case he has insight)

Correct. Current upstream rpc.gssd fork()s before handling the upcall so you'll want to use strace with '-f'.

I don't gssd does anything on its own to access /home, but some of the krb5/gssapi library routines may have started doing so.

--- Additional comment from Simo Sorce on 2014-01-21 11:29:04 EST ---

Jeff,
I think libkrb5 always had code to check some files in the user's local home directory, for example the .k5login file, although that should not be used by gssapi.

....


actually reading the code I guess one of the ccache plugins may try to read the .k5identity file in the user's home directory.

The user's home is determined by either the HOME env variable or by getpwuid() using geteuid()

Enrico, does the bug stop happening if you set HOME=/ in the rpc.gssd environment before running it ? (for example setting it in /etc/sysconfig/nfs)

--- Additional comment from nmorey on 2014-01-21 12:25:56 EST ---

I couldn't directly connect with strace as rpc.gssd got stuck when starting kdm
(which seems to open a session for my user before showing the login list?)
So i added a strace in the service file and rebooted

strace reported this:
1150  open("/nfs/home/nmorey/.k5identity", O_RDONLY

--- Additional comment from Jeff Layton on 2014-01-21 12:39:36 EST ---

That certainly looks like the culprit.

Setting $HOME looks like it should work around it. I guess we could add a setenv() call to gssd. Is there a better way to deal with this?

--- Additional comment from nmorey on 2014-01-21 12:43:57 EST ---

Is it actually needed to open this file?
I quickly googled it but I don't know if it is supposed to be read here at this precise time.

If it is not, there no point in opening it.
If it is, it means that it would be required to check if the home dir is actually over NFS4+KRB to decide whether or not to use it?

--- Additional comment from Simo Sorce on 2014-01-21 12:44:13 EST ---

Jeff,
I think we definitely should add a setenv() call to gssd unconditionally.
I guess pointing a '/' would be fine.

--- Additional comment from Simo Sorce on 2014-01-21 12:45:28 EST ---

(In reply to nmorey from comment #12)
> Is it actually needed to open this file?
> I quickly googled it but I don't know if it is supposed to be read here at
> this precise time.
> 
> If it is not, there no point in opening it.
> If it is, it means that it would be required to check if the home dir is
> actually over NFS4+KRB to decide whether or not to use it?

I do not think we need it, we shouldn't trust user files in rpc.gssd anyway.

--- Additional comment from Jeff Layton on 2014-01-21 13:32:01 EST ---

This patch makes gssd set $HOME. Before we go with it though, it would be helpful to have either of the people able to reproduce this run gssd with $HOME set to '/'. You should be able to just log in as root and run it that way by hand.

    # systemctl stop nfs-secure.service
    # export HOME='/'
    # rpc.gssd

If you do that, does it work around the problem?

--- Additional comment from nmorey on 2014-01-22 03:03:39 EST ---

The command works but I'm not sure why. Anyway when I tried testing with strace, at some point (ater restarting nfs-secure, running a manual rpc.gssd) it started working. 

HOME is by default /root (I'm not sure what systemctl sets it to)
But the file it opens is in my user home dir so definitly not the one used when the top service was started.

If the process is forked and switch to the user using setresuid, who sets the $HOME to the user true home?

--- Additional comment from Enrico Scholz on 2014-01-22 05:28:05 EST ---

for me, after

  echo HOME=/ >> /etc/sysconfig/nfs

and rebooting, things seem to work fine.  I see an

2084  open("/.k5identity", O_RDONLY)    = -1 ENOENT (No such file or directory)

which is expected after this change.  Thanks for solving this.

--- Additional comment from Jeff Layton on 2014-01-22 07:54:31 EST ---

Great, thanks for testing it. Could you try this test package as well and let me know if it also fixes the problem without setting $HOME first?

    http://koji.fedoraproject.org/koji/taskinfo?taskID=6439019

If so, I'll propose this patch upstream in the next day or so.

--- Additional comment from Jeff Layton on 2014-01-22 08:00:46 EST ---

Fixed patch. The other needed a #include added and a fixed printerr() call.

--- Additional comment from Michael Young on 2014-01-22 09:43:57 EST ---

(In reply to Jeff Layton from comment #18)
> Great, thanks for testing it. Could you try this test package as well and
> let me know if it also fixes the problem without setting $HOME first?
> 
>     http://koji.fedoraproject.org/koji/taskinfo?taskID=6439019
> 
> If so, I'll propose this patch upstream in the next day or so.

That test package works for me without $HOME set.

Comment 1 Jeff Layton 2014-01-22 16:29:47 UTC
Needs commit 2f682f25c from mainline nfs-utils.

Comment 5 Jeff Layton 2014-03-13 10:45:08 UTC
Hmm. I never actually tried to reproduce this myself.

I think a reproducer would be to set up an account that has its home directory on a kerberized NFS mount, and then do something like:

    root# su testuser
    testuser$ kinit

...it seems like that should deadlock with the old package, but I haven't tested it to be sure.

Comment 6 JianHong Yin 2014-03-14 02:55:59 UTC
https://beaker.engineering.redhat.com/jobs/611731

Comment 7 Ludek Smid 2014-06-13 09:59:48 UTC
This request was resolved in Red Hat Enterprise Linux 7.0.

Contact your manager or support representative in case you have further questions about the request.


Note You need to log in before you can comment on or make changes to this bug.