Bug 1419280
Summary: | Non-thread-safe functions used in multi-threaded rpc.gssd | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Frank Sorenson <fsorenso> |
Component: | nfs-utils | Assignee: | Steve Dickson <steved> |
Status: | CLOSED ERRATA | QA Contact: | ChunYu Wang <chunwang> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.3 | CC: | bruno.travouillon, chunwang, fsorenso, jiyin, yoyang |
Target Milestone: | rc | Keywords: | Reproducer |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | nfs-utils-1.3.0-0.36.el7 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-08-01 19:48:51 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
Frank Sorenson
2017-02-04 19:23:32 UTC
Created attachment 1247730 [details]
program to stat a path as multiple euids simultaneously
Created attachment 1247731 [details]
script to launch the reproducer and check for its success
Created attachment 1247732 [details]
patch to replace strtok() with strtok_r() and duplicate input string
to use the reproducer, modify the path to stat_as and the target path in the bz1419280_test_threads script as needed, then run the script. If the bug is reproduced as a hang (only one of the possibilities, but an easily-reproducible and testable outcome), the script will report success in reproducing the issue, and then exit: # ./bz1419280_test_threads calling stat on '/mnt/test_krb5/foo' with uids 9995 through 10005 success after 1 iterations there will be 'stat_as' processes still blocked, as seen in the following ps output: 5960 pts/4 D 0:00 ./stat_as /mnt/test_krb5/foo 9995 10005 5962 pts/4 D 0:00 ./stat_as /mnt/test_krb5/foo 9995 10005 5964 pts/4 D 0:00 ./stat_as /mnt/test_krb5/foo 9995 10005 5965 pts/4 D 0:00 ./stat_as /mnt/test_krb5/foo 9995 10005 5966 pts/4 D 0:00 ./stat_as /mnt/test_krb5/foo 9995 10005 and the stack will look similar to the following: # cat /proc/5966/stack [<ffffffffa00f1beb>] gss_cred_init+0x22b/0x350 [auth_rpcgss] [<ffffffffa0324fad>] rpcauth_lookup_credcache+0x20d/0x2b0 [sunrpc] [<ffffffffa00ef403>] gss_lookup_cred+0x13/0x20 [auth_rpcgss] [<ffffffffa0325eaf>] generic_bind_cred+0x1f/0x30 [sunrpc] [<ffffffffa0325676>] rpcauth_refreshcred+0xa6/0x1c0 [sunrpc] [<ffffffffa0314d0c>] call_refresh+0x3c/0x60 [sunrpc] [<ffffffffa03215f8>] __rpc_execute+0x98/0x440 [sunrpc] [<ffffffffa0323e78>] rpc_execute+0x68/0xc0 [sunrpc] [<ffffffffa0316896>] rpc_run_task+0xf6/0x140 [sunrpc] [<ffffffffa05651f3>] nfs4_call_sync_sequence+0x63/0xa0 [nfsv4] [<ffffffffa056787d>] _nfs4_proc_access+0x11d/0x1a0 [nfsv4] [<ffffffffa05723cc>] nfs4_proc_access+0x5c/0xf0 [nfsv4] [<ffffffffa052beeb>] nfs_do_access+0x10b/0x370 [nfs] [<ffffffffa052c325>] nfs_permission+0x1a5/0x210 [nfs] [<ffffffff812333a4>] __inode_permission+0x64/0xd0 [<ffffffff81233428>] inode_permission+0x18/0x50 [<ffffffff81235aab>] link_path_walk+0x30b/0x610 [<ffffffff81235eaf>] path_lookupat+0x7f/0x110 [<ffffffff8123883f>] filename_lookup+0xaf/0x190 [<ffffffff812389f6>] user_path_at_empty+0x36/0x40 [<ffffffff8122d8e6>] vfs_fstatat+0x66/0xc0 [<ffffffff8122de3e>] SYSC_newstat+0x2e/0x60 [<ffffffff8122e11e>] SyS_newstat+0xe/0x10 [<ffffffff816ea677>] entry_SYSCALL_64_fastpath+0x1a/0xa9 [<ffffffffffffffff>] 0xffffffffffffffff less commonly, there may also be a message in the system log: Feb 3 12:03:58 vm1 rpc.gssd[15921]: WARNING: handle_gssd_upcall: failed to find uid in upcall string 'mech=krb5' (note that the reported upcall string only shows the first portion of the string, since strtok() has modified it) if the bug does not reproduce as a hang, the test script will continue attempting: # ./bz1419280_test_threads calling stat on '/mnt/test_krb5/foo' with uids 9995 through 10005 calling stat on '/mnt/test_krb5/foo' with uids 9995 through 10005 calling stat on '/mnt/test_krb5/foo' with uids 9995 through 10005 calling stat on '/mnt/test_krb5/foo' with uids 9995 through 10005 calling stat on '/mnt/test_krb5/foo' with uids 9995 through 10005 calling stat on '/mnt/test_krb5/foo' with uids 9995 through 10005 calling stat on '/mnt/test_krb5/foo' with uids 9995 through 10005 calling stat on '/mnt/test_krb5/foo' with uids 9995 through 10005 calling stat on '/mnt/test_krb5/foo' with uids 9995 through 10005 calling stat on '/mnt/test_krb5/foo' with uids 9995 through 10005 calling stat on '/mnt/test_krb5/foo' with uids 9995 through 10005 calling stat on '/mnt/test_krb5/foo' with uids 9995 through 10005 calling stat on '/mnt/test_krb5/foo' with uids 9995 through 10005 calling stat on '/mnt/test_krb5/foo' with uids 9995 through 10005 calling stat on '/mnt/test_krb5/foo' with uids 9995 through 10005 ... (In reply to Frank Sorenson from comment #3) > Created attachment 1247732 [details] > patch to replace strtok() with strtok_r() and duplicate input string Would you mind posting this patch upstream at linux-nfs.org To get other eye on it... Thanks! Initial posting and Christoph's comment to use strsep rather than strtok_r: http://www.spinics.net/lists/linux-nfs/msg62093.html http://www.spinics.net/lists/linux-nfs/msg62099.html V2 (split between fixing strtok->strsep and duplicating upcall string): * http://www.spinics.net/lists/linux-nfs/msg62111.html [nfs-utils PATCH V2 1/2] Replace non-thread-safe strtok function calls in gssd, * http://www.spinics.net/lists/linux-nfs/msg62112.html [nfs-utils PATCH V2 2/2] Duplicate upcall string for use in error messages The customer has tested the fixes, and reports that this fixes the rpc.gssd segfaults. Unfortunately, they are still experiencing some kernel hangs, apparently due to multiple upcall strings being delivered in the same upcall. Customer's message: I have a report from the site where they recompiled the nfs-utils package after replacing the call to strtok with the multi thread safe version. Even though this fixes the rpc.gssd segfaults, they still experience some kernel hangs. Now that the original upcall string is still complete after processing, they observe the following in the logs: 1485968490 2017 fev 1 18:01:30 <nodeXXX> daemon err rpc.gssd #012handle_gssd_upcall: 'mech=krb5 uid=<uid> enctypes=<enctype1> #012mech=krb5 uid=<uid2> enctypes=<enctype2> ' (nfs/clnt0) So it would seem that several upcall strings are read at the same, when it should process only one. They also see the following: Jan 20 23:58:35 kiki2000 rpc.gssd[1432]: ERROR: Internal credentials cache error (filename: /tm/krb5ccmachine_<domain>) while initializing credential cache 'FILE:/tmp/krb5ccmachine_<domain>' Then this would look like a thread is consuming 2 upcall messages and some other threads get a NULL string (or something equivalent). Is the pipe reading function thread-safe ? While waiting for a complete fix, the site has rebuilt the nfs-utils package with the multi-thread patch reverted, and they say everything is just fine this way. I am still working to understand the issue better. Comment on attachment 1247732 [details]
patch to replace strtok() with strtok_r() and duplicate input string
patch replaced with similar one using strsep
commit 0a4f5e4daccdeba767b9ef36e30efbd7fd9a76d8 Author: Frank Sorenson <sorenson> Date: Wed Feb 15 10:38:53 2017 -0500 gssd: Duplicate the upcall string for error messages commit 0a4f5e4daccdeba767b9ef36e30efbd7fd9a76d8 Author: Frank Sorenson <sorenson> Date: Wed Feb 15 10:38:53 2017 -0500 gssd: Duplicate the upcall string for error messages commit 5ae8be8b6af1a0fdf2fa26051a05d4c04d028849 Author: Frank Sorenson <sorenson> Date: Wed Feb 15 10:36:47 2017 -0500 gssd: replace non-thread-safe strtok with strsep The customer has clarified the issue described in comment 7. An additional patch had been applied by the customer which caused gssd to read multiple upcall messages at once. There are no issues reported with the patches in this bugzilla. More then 10 times of manual tests on target version cannot reproduce this problem. By comparing to test results on older versions and reviewing code, I tend to believe this problem has been resolved on target version, I will move the status to VERIFIED first, and keep an eye on this issue during future 7.4 tests. Thanks, ChunYu Wang Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2233 |