1934866 – NFS client can hang when mounting with GSS auth against a server that doesn't have GSS configured

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1934866 - NFS client can hang when mounting with GSS auth against a server that doesn't have GSS configured

Summary: NFS client can hang when mounting with GSS auth against a server that doesn't...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	libtirpc
Sub Component:
Version:	8.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	8.5
Assignee:	Steve Dickson
QA Contact:	Zhi Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1947058
TreeView+	depends on / blocked

Reported:	2021-03-03 22:02 UTC by Scott Mayhew
Modified:	2021-11-10 07:17 UTC (History)
CC List:	2 users (show)
Fixed In Version:	libtirpc-1.1.4-5.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1947058 (view as bug list)
Environment:
Last Closed:	2021-11-09 19:33:10 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)
libtirpc patch to disallow calling auth_refresh from clnt_call with RPCSEC_GSS (2.59 KB, patch) 2021-03-03 22:04 UTC, Scott Mayhew	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2021:4375	0	None	None	None	2021-11-09 19:33:14 UTC

Description Scott Mayhew 2021-03-03 22:02:15 UTC

Description of problem:

At the Virtual Bakeathon a number of participants were having difficulty in mounting certain servers.  These were all non-knfsd servers (Hammerspace, FreeBSD, and initially NFS-Ganesha) that didn't have GSS configured.  In all cases the client would attempt an RPCSEC_GSS_INIT, the server would reply with an AUTH_ERROR, and the client would become stuck.

Version-Release number of selected component (if applicable):
libtirpc-1.1.4-4.el8.x86_64

How reproducible:

This can be reproduced somewhat easily, but you need a non-knfsd NFS server (I'm using a Hammerspace server).

Steps to Reproduce:
1. Add an nfs service principal for the NFS server to the KDC, but don't configure GSS on the NFS server itself (i.e. don't install a keytab and don't enable sec=krb5{,i,p} in the exports)
2. Attempt to mount from a client

Actual results:

The client becomes hung.

The kernel stack for mount.nfs looks like this:

[<0>] gss_cred_init+0x2b2/0x330 [auth_rpcgss]
[<0>] rpcauth_lookup_credcache+0x2b1/0x340 [sunrpc]
[<0>] rpcauth_refreshcred+0x25b/0x2a0 [sunrpc]
[<0>] __rpc_execute+0x85/0x3c0 [sunrpc]
[<0>] rpc_run_task+0x144/0x190 [sunrpc]
[<0>] nfs4_run_exchange_id+0x1f5/0x290 [nfsv4]
[<0>] _nfs4_proc_exchange_id+0x34/0x400 [nfsv4]
[<0>] nfs4_proc_exchange_id+0x3e/0x50 [nfsv4]
[<0>] nfs41_discover_server_trunking+0x1c/0xa0 [nfsv4]
[<0>] nfs4_discover_server_trunking+0x72/0x240 [nfsv4]
[<0>] nfs4_init_client+0x16e/0x240 [nfsv4]
[<0>] nfs4_set_client+0xe9/0x140 [nfsv4]
[<0>] nfs4_create_server+0x12b/0x2c0 [nfsv4]
[<0>] nfs4_try_get_tree+0x33/0xb0 [nfsv4]
[<0>] vfs_get_tree+0x25/0xb0
[<0>] do_mount+0x2e2/0x950
[<0>] ksys_mount+0xb6/0xd0
[<0>] __x64_sys_mount+0x21/0x30
[<0>] do_syscall_64+0x5b/0x1a0
[<0>] entry_SYSCALL_64_after_hwframe+0x65/0xca

The userspace stack for one of the rpc.gssd threads looks like this:

Thread 2 (Thread 0x7f05639d8700 (LWP 66844)):
#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x7f0561da50c8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x7f0565a0a900 <clnt_fd_lock>, cond=0x7f0561da50a0) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x7f0561da50a0, mutex=mutex@entry=0x7f0565a0a900 <clnt_fd_lock>) at pthread_cond_wait.c:655
#3  0x00007f05657e7b2f in clnt_vc_call (cl=0x7f055c0081f0, proc=0, xdr_args=0x7f05657fa520 <xdr_rpc_gss_init_args>, args_ptr=0x7f05639d7190, xdr_results=0x7f05657fa590 <xdr_rpc_gss_init_res>, results_ptr=0x7f05639d71c0, timeout=...) at clnt_vc.c:369
#4  0x00007f05657f94ca in _rpc_gss_refresh (auth=0x7f055c009130, options_ret=options_ret@entry=0x7f05639d7240) at auth_gss.c:517
#5  0x00007f05657f9837 in authgss_refresh (auth=<optimized out>, dummy=<optimized out>) at auth_gss.c:627
#6  0x00007f05657e7d8c in clnt_vc_call (cl=0x7f055c0081f0, proc=0, xdr_args=0x7f05657fa520 <xdr_rpc_gss_init_args>, args_ptr=0x7f05639d75a0, xdr_results=0x7f05657fa590 <xdr_rpc_gss_init_res>, results_ptr=0x7f05639d75d0, timeout=...) at clnt_vc.c:461
#7  0x00007f05657f94ca in _rpc_gss_refresh (auth=auth@entry=0x7f055c009130, options_ret=options_ret@entry=0x7f05639d7660) at auth_gss.c:517
#8  0x00007f05657f9989 in authgss_refresh (dummy=0x0, auth=0x7f055c009130) at auth_gss.c:627
#9  authgss_create (clnt=clnt@entry=0x7f055c0081f0, name=0x7f055c005f60, sec=sec@entry=0x7f05639d77f0) at auth_gss.c:232
#10 0x00007f05657f9b4e in authgss_create_default (clnt=clnt@entry=0x7f055c0081f0, service=service@entry=0x55a731a4fd30 "nfs.lab.eng.bos.redhat.com", sec=sec@entry=0x7f05639d77f0) at auth_gss.c:267
#11 0x000055a73153e38f in create_auth_rpc_client (clp=clp@entry=0x55a731a53980, tgtname=0x55a731a4fd30 "nfs.lab.eng.bos.redhat.com", tgtname@entry=0x0, clnt_return=clnt_return@entry=0x7f05639d7cb0, auth_return=auth_return@entry=0x7f05639d7cc0, uid=uid@entry=0, cred=cred@entry=0x0, authtype=0) at gssd_proc.c:394
#12 0x000055a73153e6b3 in krb5_use_machine_creds (uid=0, rpc_clnt=0x7f05639d7cb0, service=0x55a731a61350 "*", tgtname=0x0, srchost=0x0, clp=0x55a731a53980) at gssd_proc.c:570
#13 process_krb5_upcall (clp=clp@entry=0x55a731a53980, uid=uid@entry=0, fd=10, srchost=srchost@entry=0x0, tgtname=tgtname@entry=0x0, service=service@entry=0x55a731a61350 "*") at gssd_proc.c:658
#14 0x000055a73153f2c5 in handle_gssd_upcall (info=0x55a731a61330) at gssd_proc.c:822
#15 0x00007f05655c014a in start_thread (arg=<optimized out>) at pthread_create.c:479
#16 0x00007f05650d8db3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Note that the thread is in clnt_vc_call() in both frames #3 and #6.  It's self-deadlocked waiting on a condition variable.

Expected results:

If no sec= option was passed to the mount command, then the client should have fallen back to mounting with sec=sys.  Or, if one of sec=krb5/sec=krb5i/sec=krb5p was passed to the mount command, then the mount command should fail with 'mount.nfs: Protocol not supported'.

Additional info:

If you set use-gss-proxy=0 in the [gssd] section of /etc/nfs.conf and restart rpc.gssd, then the mount works.

Like wise, if you add either Environment=GSSPROXY_BEHAVIOR=REMOTE_ONLY or Environment=GSSPROXY_BEHAVIOR=LOCAL_ONLY to the systemd unit file for rpc.gssd and restart rpc.gssd, the mount works.

So it seems there could be a bug in gssproxy (more on that later).

But, the main bug seems to be in libtirpc.  Both clnt_vc_call() and clnt_dg_call() have logic that in the event that the call status is not RPC_SUCCESS, we'll attempt to refresh our credentials and do the call again.  But for GSS auth, in order to refresh our credentials we actually need to make an RPC call (the NULL call with the RPCSEC_GSS_INIT control procedure)!  So we wind up recursing into clnt_vc_call (for tcp) or clnt_dg_call (for udp), where we self-deadlock in cond_wait().  I have a patch that fixes that, but I haven't sent it upstream yet because we seem to be having issues with vger.kernel.org mail at the moment.

Comment 1 Scott Mayhew 2021-03-03 22:04:22 UTC

Created attachment 1760515 [details]
libtirpc patch to disallow calling auth_refresh from clnt_call with RPCSEC_GSS

Comment 2 Scott Mayhew 2021-03-04 18:56:42 UTC

In the bug description I mentioned that while I think this is mainly a libtirpc issue, I'm wondering if there's a gssproxy issue as well.  Here's an attempt to describe what I see.

Here's where rpc.gssd is handling an upcall from the kernel:

create_auth_rpc_client (rpc.gssd)
  authgss_create_default (libtirpc)
    authgss_create
      authgss_refresh
        _rpc_gss_refresh
          gss_init_sec_context (libgssapi_krb5)

We then make a NULL call to the NFS server with procedure field in the authgss credential set to RPCSEC_GSS_INIT and the ouput token from the gss_init_sec_context() call in the NFS payload.

                        call_stat = clnt_call(gd->clnt, NULLPROC,
                                              (xdrproc_t)xdr_rpc_gss_init_args,
                                              &send_token,
                                              (xdrproc_t)xdr_rpc_gss_init_res,
                                              (caddr_t)&gr, AUTH_TIMEOUT);

The NFS server replies with RPC_AUTHERROR since it doesn't have GSS auth enabled, and we hit this logic in clnt_vc_call():

        if (ct->ct_error.re_status == RPC_SUCCESS) {
...
        else {
                /* maybe our credentials need to be refreshed ... */
                if (refreshes-- && AUTH_REFRESH(cl->cl_auth, &reply_msg))
                        goto call_again;
        }  /* end of unsuccessful completion */

AUTH_REFRESH is just a macro that calls the ah_refresh() auth op, which in our case is authgss_refresh().  So we call

authgss_refresh
  _rpc_gss_refresh
    gss_init_sec_context

This time we have an existing context handle from the first gss_init_sec_context() call.

When rpc.gssd is not using gssproxy, the call chain for this call to gss_init_sec_context() is:

gss_init_sec_context
  mech->gss_init_sec_context
    krb5_gss_init_sec_context
      krb5_gss_init_sec_context_ext
        mutual_auth

which returns GSS_S_DEFECTIVE_TOKEN, which gets returned back up to _rpc_gss_refresh(), which further returns back up to create_auth_rpc_client() instead of diving back down into clnt_vc_call(), and no self-deadlock occurs.

Contrast that with the scenario where rpc.gssd is using gssproxy.  In that case, the call chain for the call to gss_init_sec_context() is:

gss_init_sec_context
  mech->gss_init_sec_context
    gssi_init_sec_context
      gpm_init_sec_context
        gpm_make_call

again with the existing context handle from the first gss_init_sec_context() call.

gssproxy executes the request and returns GSS_S_DEFECTIVE_TOKEN from mutual_auth(), the same as in the non-gssproxy case.

HOWEVER, then we hit this logic in gssi_init_sec_context():

        if (behavior == GPP_REMOTE_FIRST) {
            /* So remote failed, but we can fallback to local, try that */
            maj = init_ctx_local(&min, cred_handle, ctx_handle, name,
                                 mech_type, req_flags, time_req, input_cb,
                                 input_token, actual_mech_type, output_token,
                                 ret_flags, time_rec);
        }

and wind up calling down into krb5_gss_init_sec_context() again, but this time it's without the existing context handle from the first gss_init_sec_context() call.

So we return back up to _rpc_gss_refresh() with a status of GSS_S_CONTINUE_NEEDED and a new context handle, and _rpc_gss_refresh() calls back down into clnt_vc_call() where we self-deadlock in cond_wait().

So I guess the main question is, is the behavior from init_ctx_local() intended/correct?  i.e. is it supposed to be using a new context handle or the existing context handle?

Should we be running rpc.gssd with GSSPROXY_BEHAVIOR=REMOTE_ONLY to avoid the local mech?

Comment 3 Robbie Harwood 2021-03-10 23:18:09 UTC

Thank you for that level of detail!  It's very helpful.

> Should we be running rpc.gssd with GSSPROXY_BEHAVIOR=REMOTE_ONLY to avoid the local mech?

As framed, yes: if you don't want local, then set only remote.  (This is not my final answer.)

But I'm not sure that bypassing local is right here.  Since it has a status of CONTINUE_NEEDED, that means GSSAPI thinks negotiation could succeed.  Is this a problem for reasons other than the deadlock?

Comment 4 Steve Dickson 2021-04-07 14:43:15 UTC

commit 9359a704e50df111d4646c2233076dd56d545489 (tag: libtirpc-1-3-2-rc1, origin/master, origin/HEAD)
Author: Scott Mayhew <smayhew>
Date:   Mon Mar 15 16:46:12 2021 -0400

    libtirpc: disallow calling auth_refresh from clnt_call with RPCSEC_GSS

Comment 11 Zhi Li 2021-05-06 08:13:28 UTC

Have finished regression tests and no new issue found with libtirpc-1.1.4-5.el8 already.

I'm verifying this as sanity only for now.

Comment 13 errata-xmlrpc 2021-11-09 19:33:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (libtirpc bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4375

Note You need to log in before you can comment on or make changes to this bug.