Bug 1419280

Summary: Non-thread-safe functions used in multi-threaded rpc.gssd
Product: Red Hat Enterprise Linux 7 Reporter: Frank Sorenson <fsorenso>
Component: nfs-utilsAssignee: Steve Dickson <steved>
Status: CLOSED ERRATA QA Contact: ChunYu Wang <chunwang>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.3CC: bruno.travouillon, chunwang, fsorenso, jiyin, yoyang
Target Milestone: rcKeywords: Reproducer
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: nfs-utils-1.3.0-0.36.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-01 19:48:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
program to stat a path as multiple euids simultaneously
none
script to launch the reproducer and check for its success
none
patch to replace strtok() with strtok_r() and duplicate input string none

Description Frank Sorenson 2017-02-04 19:23:32 UTC
Description of problem:

The handle_gssd_upcall() function in rpc.gssd makes calls to the non-thread-safe strtok() library function, leading to process hangs, crashes, or incorrect behavior.  In addition, the strtok() function modifies the input string, so when errors occur, the error messages are incomplete.


Version-Release number of selected component (if applicable):

nfs-utils-1.3.0-0.33.el7.x86_64
latest nfs-utils git tree


How reproducible:

very.  reproducer attached
also, issue can be seen through code inspection


Steps to Reproduce:

reproducer attached

1) set up nfs mount with sec=krb5 (any flavor)
2) access a file or directory on the mount with multiple uids simultaneously
3) each will trigger an upcall to rpc.gssd
4) if more than one of these upcalls hits handle_gssd_upcall() simultaneously, the strtok() function's internal state may become modified by one thread, leading to incorrect results in the other threads


the attached reproducer forks multiple threads, then causes the processes to call stat() on a path simultaneously, each running as a different euid.


Actual results:

the strtok() function is non-thread-safe, so the upcall may process the input string incorrectly (or the wrong string entirely), leading to one thread operating on another thread's string.

the bug may lead to a failure to process the input string correctly, potentially crashing rpc.gssd, or resulting in a message such as:
Feb  3 12:03:58 vm1 rpc.gssd[15921]: WARNING: handle_gssd_upcall: failed to find uid in upcall string 'mech=krb5'

the processes which triggered the upcall may also hang in the upcall (in gss_cred_init), with a backtrace similar to the following:
/proc/PID/stack
[<ffffffffa00f1beb>] gss_cred_init+0x22b/0x350 [auth_rpcgss]
[<ffffffffa0324fad>] rpcauth_lookup_credcache+0x20d/0x2b0 [sunrpc]
[<ffffffffa00ef403>] gss_lookup_cred+0x13/0x20 [auth_rpcgss]
[<ffffffffa03245eb>] rpcauth_lookupcred+0x8b/0xd0 [sunrpc]
[<ffffffffa0325715>] o _refreshcred+0x145/0x1c0 [sunrpc]
[<ffffffffa0314d0c>] call_refresh+0x3c/0x60 [sunrpc]
[<ffffffffa03215f8>] __rpc_execute+0x98/0x440 [sunrpc]
[<ffffffffa0323e78>] rpc_execute+0x68/0xc0 [sunrpc]
[<ffffffffa0316896>] rpc_run_task+0xf6/0x140 [sunrpc]
[<ffffffffa05651f3>] nfs4_call_sync_sequence+0x63/0xa0 [nfsv4]
[<ffffffffa056605c>] _nfs4_proc_getattr+0xcc/0xf0 [nfsv4]
[<ffffffffa05724cc>] nfs4_proc_getattr+0x6c/0x100 [nfsv4]
[<ffffffffa0530ecb>] __nfs_revalidate_inode+0xcb/0x350 [nfs]
[<ffffffffa0531281>] nfs_getattr+0x81/0x210 [nfs]
[<ffffffff8122d749>] vfs_getattr_nosec+0x29/0x40
[<ffffffff8122d816>] vfs_getattr+0x26/0x30
[<ffffffff8122d8f8>] vfs_fstatat+0x78/0xc0
[<ffffffff8122de3e>] SYSC_newstat+0x2e/0x60
[<ffffffff8122e11e>] SyS_newstat+0xe/0x10
[<ffffffff816ea677>] entry_SYSCALL_64_fastpath+0x1a/0xa9
[<ffffffffffffffff>] 0xffffffffffffffff


Expected results:

code is threadsafe, with no invalid failures, crashes, or hangs


Additional info:


since the strtok() function modifies the original string, when a failure occurs, verbose logging may report the error using only a partial string from the upcall

the attached patch replaces the strtok() call with the reentrant strtok_r() function, using a per-thread variable to maintain the function's state.


I have not thoroughly reviewed the source code for other non-thread-safe function usage, so there may be more

Comment 1 Frank Sorenson 2017-02-04 19:49:24 UTC
Created attachment 1247730 [details]
program to stat a path as multiple euids simultaneously

Comment 2 Frank Sorenson 2017-02-04 19:52:03 UTC
Created attachment 1247731 [details]
script to launch the reproducer and check for its success

Comment 3 Frank Sorenson 2017-02-04 20:07:49 UTC
Created attachment 1247732 [details]
patch to replace strtok() with strtok_r() and duplicate input string

Comment 4 Frank Sorenson 2017-02-04 20:22:28 UTC
to use the reproducer, modify the path to stat_as and the target path in the bz1419280_test_threads script as needed, then run the script.

If the bug is reproduced as a hang (only one of the possibilities, but an easily-reproducible and testable outcome), the script will report success in reproducing the issue, and then exit:

# ./bz1419280_test_threads
calling stat on '/mnt/test_krb5/foo' with uids 9995 through 10005
success after 1 iterations

there will be 'stat_as' processes still blocked, as seen in the following ps output:
 5960 pts/4    D      0:00 ./stat_as /mnt/test_krb5/foo 9995 10005
 5962 pts/4    D      0:00 ./stat_as /mnt/test_krb5/foo 9995 10005
 5964 pts/4    D      0:00 ./stat_as /mnt/test_krb5/foo 9995 10005
 5965 pts/4    D      0:00 ./stat_as /mnt/test_krb5/foo 9995 10005
 5966 pts/4    D      0:00 ./stat_as /mnt/test_krb5/foo 9995 10005

and the stack will look similar to the following:
# cat /proc/5966/stack
[<ffffffffa00f1beb>] gss_cred_init+0x22b/0x350 [auth_rpcgss]
[<ffffffffa0324fad>] rpcauth_lookup_credcache+0x20d/0x2b0 [sunrpc]
[<ffffffffa00ef403>] gss_lookup_cred+0x13/0x20 [auth_rpcgss]
[<ffffffffa0325eaf>] generic_bind_cred+0x1f/0x30 [sunrpc]
[<ffffffffa0325676>] rpcauth_refreshcred+0xa6/0x1c0 [sunrpc]
[<ffffffffa0314d0c>] call_refresh+0x3c/0x60 [sunrpc]
[<ffffffffa03215f8>] __rpc_execute+0x98/0x440 [sunrpc]
[<ffffffffa0323e78>] rpc_execute+0x68/0xc0 [sunrpc]
[<ffffffffa0316896>] rpc_run_task+0xf6/0x140 [sunrpc]
[<ffffffffa05651f3>] nfs4_call_sync_sequence+0x63/0xa0 [nfsv4]
[<ffffffffa056787d>] _nfs4_proc_access+0x11d/0x1a0 [nfsv4]
[<ffffffffa05723cc>] nfs4_proc_access+0x5c/0xf0 [nfsv4]
[<ffffffffa052beeb>] nfs_do_access+0x10b/0x370 [nfs]
[<ffffffffa052c325>] nfs_permission+0x1a5/0x210 [nfs]
[<ffffffff812333a4>] __inode_permission+0x64/0xd0
[<ffffffff81233428>] inode_permission+0x18/0x50
[<ffffffff81235aab>] link_path_walk+0x30b/0x610
[<ffffffff81235eaf>] path_lookupat+0x7f/0x110
[<ffffffff8123883f>] filename_lookup+0xaf/0x190
[<ffffffff812389f6>] user_path_at_empty+0x36/0x40
[<ffffffff8122d8e6>] vfs_fstatat+0x66/0xc0
[<ffffffff8122de3e>] SYSC_newstat+0x2e/0x60
[<ffffffff8122e11e>] SyS_newstat+0xe/0x10
[<ffffffff816ea677>] entry_SYSCALL_64_fastpath+0x1a/0xa9
[<ffffffffffffffff>] 0xffffffffffffffff

less commonly, there may also be a message in the system log:
Feb  3 12:03:58 vm1 rpc.gssd[15921]: WARNING: handle_gssd_upcall: failed to find uid in upcall string 'mech=krb5'

(note that the reported upcall string only shows the first portion of the string, since strtok() has modified it)


if the bug does not reproduce as a hang, the test script will continue attempting:
# ./bz1419280_test_threads
calling stat on '/mnt/test_krb5/foo' with uids 9995 through 10005
calling stat on '/mnt/test_krb5/foo' with uids 9995 through 10005
calling stat on '/mnt/test_krb5/foo' with uids 9995 through 10005
calling stat on '/mnt/test_krb5/foo' with uids 9995 through 10005
calling stat on '/mnt/test_krb5/foo' with uids 9995 through 10005
calling stat on '/mnt/test_krb5/foo' with uids 9995 through 10005
calling stat on '/mnt/test_krb5/foo' with uids 9995 through 10005
calling stat on '/mnt/test_krb5/foo' with uids 9995 through 10005
calling stat on '/mnt/test_krb5/foo' with uids 9995 through 10005
calling stat on '/mnt/test_krb5/foo' with uids 9995 through 10005
calling stat on '/mnt/test_krb5/foo' with uids 9995 through 10005
calling stat on '/mnt/test_krb5/foo' with uids 9995 through 10005
calling stat on '/mnt/test_krb5/foo' with uids 9995 through 10005
calling stat on '/mnt/test_krb5/foo' with uids 9995 through 10005
calling stat on '/mnt/test_krb5/foo' with uids 9995 through 10005
...

Comment 5 Steve Dickson 2017-02-06 17:01:19 UTC
(In reply to Frank Sorenson from comment #3)
> Created attachment 1247732 [details]
> patch to replace strtok() with strtok_r() and duplicate input string

Would you mind posting this patch upstream at linux-nfs.org
To get other eye on it... Thanks!

Comment 6 Frank Sorenson 2017-02-15 13:58:01 UTC
Initial posting and Christoph's comment to use strsep rather than strtok_r:
	http://www.spinics.net/lists/linux-nfs/msg62093.html
	http://www.spinics.net/lists/linux-nfs/msg62099.html


V2 (split between fixing strtok->strsep and duplicating upcall string):

 * http://www.spinics.net/lists/linux-nfs/msg62111.html
   [nfs-utils PATCH V2 1/2] Replace non-thread-safe strtok function calls in gssd,	

 * http://www.spinics.net/lists/linux-nfs/msg62112.html
   [nfs-utils PATCH V2 2/2] Duplicate upcall string for use in error messages

Comment 7 Frank Sorenson 2017-02-20 13:53:41 UTC
The customer has tested the fixes, and reports that this fixes the rpc.gssd segfaults.

Unfortunately, they are still experiencing some kernel hangs, apparently due to multiple upcall strings being delivered in the same upcall.  Customer's message:
	I have a report from the site where they recompiled the nfs-utils package after replacing the call to strtok with the multi thread safe version.
	Even though this fixes the rpc.gssd segfaults, they still experience some kernel hangs.
	Now that the original upcall string is still complete after processing, they observe the following in the logs:
	
	1485968490 2017 fev 1 18:01:30 <nodeXXX> daemon err rpc.gssd #012handle_gssd_upcall: 'mech=krb5 uid=<uid> enctypes=<enctype1> #012mech=krb5 uid=<uid2> enctypes=<enctype2> ' (nfs/clnt0)
	
	So it would seem that several upcall strings are read at the same, when it should process only one. They also see the following:
	
	Jan 20 23:58:35 kiki2000 rpc.gssd[1432]: ERROR: Internal credentials cache error (filename: /tm/krb5ccmachine_<domain>) while initializing credential cache 'FILE:/tmp/krb5ccmachine_<domain>'
	
	
	Then this would look like a thread is consuming 2 upcall messages and some other threads get a NULL string (or something equivalent).
	
	Is the pipe reading function thread-safe ?
	
	While waiting for a complete fix, the site has rebuilt the nfs-utils package with the multi-thread patch reverted, and they say everything is just fine this way.


I am still working to understand the issue better.

Comment 8 Frank Sorenson 2017-02-20 13:55:02 UTC
Comment on attachment 1247732 [details]
patch to replace strtok() with strtok_r() and duplicate input string

patch replaced with similar one using strsep

Comment 9 Steve Dickson 2017-02-23 15:57:18 UTC
commit 0a4f5e4daccdeba767b9ef36e30efbd7fd9a76d8
Author: Frank Sorenson <sorenson>
Date:   Wed Feb 15 10:38:53 2017 -0500

    gssd:  Duplicate the upcall string for error messages

Comment 10 Steve Dickson 2017-02-27 23:22:12 UTC
commit 0a4f5e4daccdeba767b9ef36e30efbd7fd9a76d8
Author: Frank Sorenson <sorenson>
Date:   Wed Feb 15 10:38:53 2017 -0500

    gssd:  Duplicate the upcall string for error messages

commit 5ae8be8b6af1a0fdf2fa26051a05d4c04d028849
Author: Frank Sorenson <sorenson>
Date:   Wed Feb 15 10:36:47 2017 -0500

    gssd: replace non-thread-safe strtok with strsep

Comment 11 Frank Sorenson 2017-02-28 13:21:46 UTC
The customer has clarified the issue described in comment 7.  An additional patch had been applied by the customer which caused gssd to read multiple upcall messages at once.

There are no issues reported with the patches in this bugzilla.

Comment 14 ChunYu Wang 2017-03-30 06:32:01 UTC
More then 10 times of manual tests on target version cannot reproduce this problem. By comparing to test results on older versions and reviewing code, I tend to believe this problem has been resolved on target version, I will move the status to VERIFIED first, and keep an eye on this issue during future 7.4 tests.

Thanks,
ChunYu Wang

Comment 15 errata-xmlrpc 2017-08-01 19:48:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2233