Bug 1325019 - rpc.gssd uses 100% CPU and lots of I/O when Kerberos ticket expires. [NEEDINFO]
Summary: rpc.gssd uses 100% CPU and lots of I/O when Kerberos ticket expires.
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: nfs-utils
Version: 6.9
Hardware: All
OS: Linux
unspecified
medium
Target Milestone: rc
: ---
Assignee: Steve Dickson
QA Contact: Filesystem QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-04-07 21:12 UTC by Ender
Modified: 2017-12-06 11:51 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-12-06 11:51:07 UTC
Target Upstream Version:
steved: needinfo? (ender)


Attachments (Terms of Use)
Close descriptor when return value is POLLERR. (931 bytes, patch)
2016-04-07 21:12 UTC, Ender
no flags Details | Diff

Description Ender 2016-04-07 21:12:26 UTC
Created attachment 1144898 [details]
Close descriptor when return value is POLLERR.

Description of problem:

Under some circumstances, rpc.gssd spikes to 100% CPU when a Kerberos context disappears from disk, due to a corner case in how things are handled internally when the return struct of poll() is analyzed.

The internet is full of reports from people having this issue, admittedly several years old, as it seems that it is no longer a problem after the big refactor that took out the inotify and poll logics.  Still, it's a problem for us because we use RHEL and CentOS 6.

After spending some quality time with gdb, I found that rpc.gssd hits a corner case when the clntXXX/gssd named pipe is deleted while rpc.gssd is still attached to it (but the directory is still there). In that situation the poll() function in the main loop returns a value of POLLERR|POLLHUP, but there's no handling for POLLERR in scan_poll_results() other than re-read all the contents of /var/lib/nfs/rpc_pipefs/ again for changes. Sadly, the containing directory is still there but empty, so the deleting logic is not triggered and the problem remains, so the server goes to 100% CPU reading those directories over, and over and over again (poll()'ing in the meantime).


Version-Release number of selected component (if applicable):

All of them up to latest release (1.2.3-64).


How reproducible:
So far this has happened to us only with mosh/screen/tmux sessions, so I suspect that there's something in these that triggers the behaviour (probably the user is there but the ticket is expired).  It happens when the directory clntXXX is there and rpc.gssd has a fd pointing to the corresponding gssd named pipe (see fd 24):

[...]
lrwx------. 1 root root 64 Mar 25 10:26 2 -> /dev/null
lr-x------. 1 root root 64 Mar 25 10:26 20 -> /var/lib/nfs/rpc_pipefs/nfs/clnt14e1
lr-x------. 1 root root 64 Mar 25 10:26 21 -> /var/lib/nfs/rpc_pipefs/nfs/clnt173b
lr-x------. 1 root root 64 Mar 25 10:26 22 -> /var/lib/nfs/rpc_pipefs/nfs/clnt1aea
lr-x------. 1 root root 64 Mar 25 10:26 23 -> /var/lib/nfs/rpc_pipefs/nfs/clnt18c4
lrwx------. 1 root root 64 Mar 25 10:26 24 -> /var/lib/nfs/rpc_pipefs/nfs/clnt14e1
lr-x------. 1 root root 64 Mar 25 10:26 25 -> /var/lib/nfs/rpc_pipefs/gssd/clntXX
lrwx------. 1 root root 64 Mar 25 10:26 26 -> /var/lib/nfs/rpc_pipefs/gssd/clntXX/gssd
lrwx------. 1 root root 64 Mar 25 10:26 27 -> /var/lib/nfs/rpc_pipefs/nfs/clnt14e1/gssd (deleted)
lr-x------. 1 root root 64 Mar 28 11:42 28 -> /var/lib/nfs/rpc_pipefs/nfs/clnt1aec
[...]

To make sure, I ran a "memset(&pollarray[i], 0, sizeof(struct pollfd))" under gdb and watched how rpc.gssd returned to normal operations (strace showed everything fine, ls -l /proc/PID/fd didn't show anything odd).

Patch attached applies to nfs-utils-1.2.3-64.el6.  I haven't seen a single appearance of this bug ever since I patched our internal binary.

Comment 2 Steve Dickson 2016-08-24 19:39:10 UTC
Could you please post the proposed patch to the NFS upstream at
    linux-nfs.org

Using the patch format described in 
   https://www.kernel.org/doc/Documentation/SubmittingPatches

esp the Signed-off-by, subject line and description 

tia!

Comment 3 Ender 2016-09-05 17:26:05 UTC
Sorry, I missed your note, Steve.  I'll do as soon as I have a moment.  Thanks!

Comment 4 Jan Kurik 2017-12-06 11:51:07 UTC
Red Hat Enterprise Linux 6 is in the Production 3 Phase. During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available.

The official life cycle policy can be reviewed here:

http://redhat.com/rhel/lifecycle

This issue does not meet the inclusion criteria for the Production 3 Phase and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification. Note that a strong business justification will be required for re-evaluation. Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL:

https://access.redhat.com/


Note You need to log in before you can comment on or make changes to this bug.