Bug 1934866 - NFS client can hang when mounting with GSS auth against a server that doesn't have GSS configured
Summary: NFS client can hang when mounting with GSS auth against a server that doesn't...
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: libtirpc
Version: 8.4
Hardware: Unspecified
OS: Unspecified
Target Milestone: rc
: 8.5
Assignee: Steve Dickson
QA Contact: Zhi Li
Depends On:
Blocks: 1947058
TreeView+ depends on / blocked
Reported: 2021-03-03 22:02 UTC by Scott Mayhew
Modified: 2021-11-10 07:17 UTC (History)
2 users (show)

Fixed In Version: libtirpc-1.1.4-5.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1947058 (view as bug list)
Last Closed: 2021-11-09 19:33:10 UTC
Type: Bug
Target Upstream Version:

Attachments (Terms of Use)
libtirpc patch to disallow calling auth_refresh from clnt_call with RPCSEC_GSS (2.59 KB, patch)
2021-03-03 22:04 UTC, Scott Mayhew
no flags Details | Diff

System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2021:4375 0 None None None 2021-11-09 19:33:14 UTC

Description Scott Mayhew 2021-03-03 22:02:15 UTC
Description of problem:

At the Virtual Bakeathon a number of participants were having difficulty in mounting certain servers.  These were all non-knfsd servers (Hammerspace, FreeBSD, and initially NFS-Ganesha) that didn't have GSS configured.  In all cases the client would attempt an RPCSEC_GSS_INIT, the server would reply with an AUTH_ERROR, and the client would become stuck.

Version-Release number of selected component (if applicable):

How reproducible:

This can be reproduced somewhat easily, but you need a non-knfsd NFS server (I'm using a Hammerspace server).

Steps to Reproduce:
1. Add an nfs service principal for the NFS server to the KDC, but don't configure GSS on the NFS server itself (i.e. don't install a keytab and don't enable sec=krb5{,i,p} in the exports)
2. Attempt to mount from a client

Actual results:

The client becomes hung.

The kernel stack for mount.nfs looks like this:

[<0>] gss_cred_init+0x2b2/0x330 [auth_rpcgss]
[<0>] rpcauth_lookup_credcache+0x2b1/0x340 [sunrpc]
[<0>] rpcauth_refreshcred+0x25b/0x2a0 [sunrpc]
[<0>] __rpc_execute+0x85/0x3c0 [sunrpc]
[<0>] rpc_run_task+0x144/0x190 [sunrpc]
[<0>] nfs4_run_exchange_id+0x1f5/0x290 [nfsv4]
[<0>] _nfs4_proc_exchange_id+0x34/0x400 [nfsv4]
[<0>] nfs4_proc_exchange_id+0x3e/0x50 [nfsv4]
[<0>] nfs41_discover_server_trunking+0x1c/0xa0 [nfsv4]
[<0>] nfs4_discover_server_trunking+0x72/0x240 [nfsv4]
[<0>] nfs4_init_client+0x16e/0x240 [nfsv4]
[<0>] nfs4_set_client+0xe9/0x140 [nfsv4]
[<0>] nfs4_create_server+0x12b/0x2c0 [nfsv4]
[<0>] nfs4_try_get_tree+0x33/0xb0 [nfsv4]
[<0>] vfs_get_tree+0x25/0xb0
[<0>] do_mount+0x2e2/0x950
[<0>] ksys_mount+0xb6/0xd0
[<0>] __x64_sys_mount+0x21/0x30
[<0>] do_syscall_64+0x5b/0x1a0
[<0>] entry_SYSCALL_64_after_hwframe+0x65/0xca

The userspace stack for one of the rpc.gssd threads looks like this:

Thread 2 (Thread 0x7f05639d8700 (LWP 66844)):
#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x7f0561da50c8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x7f0565a0a900 <clnt_fd_lock>, cond=0x7f0561da50a0) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x7f0561da50a0, mutex=mutex@entry=0x7f0565a0a900 <clnt_fd_lock>) at pthread_cond_wait.c:655
#3  0x00007f05657e7b2f in clnt_vc_call (cl=0x7f055c0081f0, proc=0, xdr_args=0x7f05657fa520 <xdr_rpc_gss_init_args>, args_ptr=0x7f05639d7190, xdr_results=0x7f05657fa590 <xdr_rpc_gss_init_res>, results_ptr=0x7f05639d71c0, timeout=...) at clnt_vc.c:369
#4  0x00007f05657f94ca in _rpc_gss_refresh (auth=0x7f055c009130, options_ret=options_ret@entry=0x7f05639d7240) at auth_gss.c:517
#5  0x00007f05657f9837 in authgss_refresh (auth=<optimized out>, dummy=<optimized out>) at auth_gss.c:627
#6  0x00007f05657e7d8c in clnt_vc_call (cl=0x7f055c0081f0, proc=0, xdr_args=0x7f05657fa520 <xdr_rpc_gss_init_args>, args_ptr=0x7f05639d75a0, xdr_results=0x7f05657fa590 <xdr_rpc_gss_init_res>, results_ptr=0x7f05639d75d0, timeout=...) at clnt_vc.c:461
#7  0x00007f05657f94ca in _rpc_gss_refresh (auth=auth@entry=0x7f055c009130, options_ret=options_ret@entry=0x7f05639d7660) at auth_gss.c:517
#8  0x00007f05657f9989 in authgss_refresh (dummy=0x0, auth=0x7f055c009130) at auth_gss.c:627
#9  authgss_create (clnt=clnt@entry=0x7f055c0081f0, name=0x7f055c005f60, sec=sec@entry=0x7f05639d77f0) at auth_gss.c:232
#10 0x00007f05657f9b4e in authgss_create_default (clnt=clnt@entry=0x7f055c0081f0, service=service@entry=0x55a731a4fd30 "nfs.lab.eng.bos.redhat.com", sec=sec@entry=0x7f05639d77f0) at auth_gss.c:267
#11 0x000055a73153e38f in create_auth_rpc_client (clp=clp@entry=0x55a731a53980, tgtname=0x55a731a4fd30 "nfs.lab.eng.bos.redhat.com", tgtname@entry=0x0, clnt_return=clnt_return@entry=0x7f05639d7cb0, auth_return=auth_return@entry=0x7f05639d7cc0, uid=uid@entry=0, cred=cred@entry=0x0, authtype=0) at gssd_proc.c:394
#12 0x000055a73153e6b3 in krb5_use_machine_creds (uid=0, rpc_clnt=0x7f05639d7cb0, service=0x55a731a61350 "*", tgtname=0x0, srchost=0x0, clp=0x55a731a53980) at gssd_proc.c:570
#13 process_krb5_upcall (clp=clp@entry=0x55a731a53980, uid=uid@entry=0, fd=10, srchost=srchost@entry=0x0, tgtname=tgtname@entry=0x0, service=service@entry=0x55a731a61350 "*") at gssd_proc.c:658
#14 0x000055a73153f2c5 in handle_gssd_upcall (info=0x55a731a61330) at gssd_proc.c:822
#15 0x00007f05655c014a in start_thread (arg=<optimized out>) at pthread_create.c:479
#16 0x00007f05650d8db3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Note that the thread is in clnt_vc_call() in both frames #3 and #6.  It's self-deadlocked waiting on a condition variable.

Expected results:

If no sec= option was passed to the mount command, then the client should have fallen back to mounting with sec=sys.  Or, if one of sec=krb5/sec=krb5i/sec=krb5p was passed to the mount command, then the mount command should fail with 'mount.nfs: Protocol not supported'.

Additional info:

If you set use-gss-proxy=0 in the [gssd] section of /etc/nfs.conf and restart rpc.gssd, then the mount works.

Like wise, if you add either Environment=GSSPROXY_BEHAVIOR=REMOTE_ONLY or Environment=GSSPROXY_BEHAVIOR=LOCAL_ONLY to the systemd unit file for rpc.gssd and restart rpc.gssd, the mount works.

So it seems there could be a bug in gssproxy (more on that later).

But, the main bug seems to be in libtirpc.  Both clnt_vc_call() and clnt_dg_call() have logic that in the event that the call status is not RPC_SUCCESS, we'll attempt to refresh our credentials and do the call again.  But for GSS auth, in order to refresh our credentials we actually need to make an RPC call (the NULL call with the RPCSEC_GSS_INIT control procedure)!  So we wind up recursing into clnt_vc_call (for tcp) or clnt_dg_call (for udp), where we self-deadlock in cond_wait().  I have a patch that fixes that, but I haven't sent it upstream yet because we seem to be having issues with vger.kernel.org mail at the moment.

Comment 1 Scott Mayhew 2021-03-03 22:04:22 UTC
Created attachment 1760515 [details]
libtirpc patch to disallow calling auth_refresh from clnt_call with RPCSEC_GSS

Comment 2 Scott Mayhew 2021-03-04 18:56:42 UTC
In the bug description I mentioned that while I think this is mainly a libtirpc issue, I'm wondering if there's a gssproxy issue as well.  Here's an attempt to describe what I see.

Here's where rpc.gssd is handling an upcall from the kernel:

create_auth_rpc_client (rpc.gssd)
  authgss_create_default (libtirpc)
          gss_init_sec_context (libgssapi_krb5)

We then make a NULL call to the NFS server with procedure field in the authgss credential set to RPCSEC_GSS_INIT and the ouput token from the gss_init_sec_context() call in the NFS payload.

                        call_stat = clnt_call(gd->clnt, NULLPROC,
                                              (caddr_t)&gr, AUTH_TIMEOUT);

The NFS server replies with RPC_AUTHERROR since it doesn't have GSS auth enabled, and we hit this logic in clnt_vc_call():

        if (ct->ct_error.re_status == RPC_SUCCESS) {
        else {
                /* maybe our credentials need to be refreshed ... */
                if (refreshes-- && AUTH_REFRESH(cl->cl_auth, &reply_msg))
                        goto call_again;
        }  /* end of unsuccessful completion */

AUTH_REFRESH is just a macro that calls the ah_refresh() auth op, which in our case is authgss_refresh().  So we call


This time we have an existing context handle from the first gss_init_sec_context() call.

When rpc.gssd is not using gssproxy, the call chain for this call to gss_init_sec_context() is:


which returns GSS_S_DEFECTIVE_TOKEN, which gets returned back up to _rpc_gss_refresh(), which further returns back up to create_auth_rpc_client() instead of diving back down into clnt_vc_call(), and no self-deadlock occurs.

Contrast that with the scenario where rpc.gssd is using gssproxy.  In that case, the call chain for the call to gss_init_sec_context() is:


again with the existing context handle from the first gss_init_sec_context() call.

gssproxy executes the request and returns GSS_S_DEFECTIVE_TOKEN from mutual_auth(), the same as in the non-gssproxy case.

HOWEVER, then we hit this logic in gssi_init_sec_context():

        if (behavior == GPP_REMOTE_FIRST) {
            /* So remote failed, but we can fallback to local, try that */
            maj = init_ctx_local(&min, cred_handle, ctx_handle, name,
                                 mech_type, req_flags, time_req, input_cb,
                                 input_token, actual_mech_type, output_token,
                                 ret_flags, time_rec);

and wind up calling down into krb5_gss_init_sec_context() again, but this time it's without the existing context handle from the first gss_init_sec_context() call.

So we return back up to _rpc_gss_refresh() with a status of GSS_S_CONTINUE_NEEDED and a new context handle, and _rpc_gss_refresh() calls back down into clnt_vc_call() where we self-deadlock in cond_wait().

So I guess the main question is, is the behavior from init_ctx_local() intended/correct?  i.e. is it supposed to be using a new context handle or the existing context handle?

Should we be running rpc.gssd with GSSPROXY_BEHAVIOR=REMOTE_ONLY to avoid the local mech?

Comment 3 Robbie Harwood 2021-03-10 23:18:09 UTC
Thank you for that level of detail!  It's very helpful.

> Should we be running rpc.gssd with GSSPROXY_BEHAVIOR=REMOTE_ONLY to avoid the local mech?

As framed, yes: if you don't want local, then set only remote.  (This is not my final answer.)

But I'm not sure that bypassing local is right here.  Since it has a status of CONTINUE_NEEDED, that means GSSAPI thinks negotiation could succeed.  Is this a problem for reasons other than the deadlock?

Comment 4 Steve Dickson 2021-04-07 14:43:15 UTC
commit 9359a704e50df111d4646c2233076dd56d545489 (tag: libtirpc-1-3-2-rc1, origin/master, origin/HEAD)
Author: Scott Mayhew <smayhew>
Date:   Mon Mar 15 16:46:12 2021 -0400

    libtirpc: disallow calling auth_refresh from clnt_call with RPCSEC_GSS

Comment 11 Zhi Li 2021-05-06 08:13:28 UTC
Have finished regression tests and no new issue found with libtirpc-1.1.4-5.el8 already.

I'm verifying this as sanity only for now.

Comment 13 errata-xmlrpc 2021-11-09 19:33:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (libtirpc bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.