1466944 – Odd NFS client hangs that go away after restarting gssproxy

Bug 1466944 - Odd NFS client hangs that go away after restarting gssproxy

Summary: Odd NFS client hangs that go away after restarting gssproxy

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	gssproxy
Sub Component:
Version:	25
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Robbie Harwood
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-06-30 20:18 UTC by Jason Tibbitts
Modified:	2017-11-11 03:07 UTC (History)
CC List:	11 users (show)
Fixed In Version:	gssproxy-0.7.0-22.fc27
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-11-11 03:07:52 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Add timeout harder patch (945 bytes, patch) 2017-08-31 17:47 UTC, Simo Sorce	no flags	Details \| Diff
Very WIP patch to allow some client side debugging (4.68 KB, patch) 2017-08-31 18:57 UTC, Simo Sorce	no flags	Details \| Diff
View All

Description Jason Tibbitts 2017-06-30 20:18:22 UTC

This is an odd problem, so please bear with me as I try to be coherent.

We have a medium-sized network with a bunch of desktops and user-accessible computational servers, with several file servers containing user storage. (Most users will access more than one file server.) Everything is shared via NFS4, normally krb5p but I have downgraded to plain "krb5" to try and debug some issues. The clients are all Fedora 25; all are fully updated (so gssproxy-0.7.0-9.fc25.x86_64) and most have been booted into 4.11.6-201.fc25.x86_64.

We run a little monitoring service which extracts data from the clients every so often. One of the things it does is run 'df'. It's a socket activate service and runs as 'nobody'. Our user desktops also have a little applet which runs the 'quota' every so often and yells at users when they get close to their disk quotas. We uniformly run KDE/plasma/whatever it's called. (This is actually relevant.)

Today I needed to restart a file server. Generally this is no big deal; any users here at 6:30 would see a pause of a couple of minutes but things have always recovered well enough. Unfortunately that wasn't the case today.

Instead of recovering, the machines started accumulating 'df' processes in the 'D' state, started by the monitoring system. And afterwards, the periodic quota command would hang, as would (most of the time) anything else accessing the user home directory. Killing the 'df' processes would immediately unblock the other processes, but they would come back eventually. When killing them, the client would log a few:

[182807.488098] NFS: nfs4_reclaim_open_state: Lock reclaim failed!

And the server would log:

[ 9046.812818] nfsd4_validate_stateid: 69 callbacks suppressed
[ 9046.818438] NFSD: client 172.21.86.221 testing state ID with incorrect client ID
(with the latter repeated a few times).

Turned out that any user with a principal, and root (which I guess has the machine principal) can run df just fine. 'nobody' or any user without a principal can't; it just hangs. Same goes for basically any access to any NFS at all. And once one process hangs, all other NFS access hangs too.

If I restart gssproxy, everything immediately unblocks and df starts working for everyone. On some clients, when I did this, this was immediately logged:

[81691.028428] rpc.gssd[2398]: segfault at 0 ip 00005621102ccdae sp 00007fba9b7fd7c0 error 4 in rpc.gssd[5621102c6000+14000]
[81691.029195] NFS: state manager: reset session failed on NFSv4 server web1 with error 13

So far things appear to be OK on all of these clients after the gssproxy restart.

I do have gssproxy running with debugging enabled (level 3) on one of these machines, but that machine is also an NFS server (it's my personal desktop) and so I'm not entirely sure if those logs are useful. They also contain a whole lot of data per line. But if you tell me what you might like to see and I can try to get it for you. I will also try to set up some VMs to try and reproduce this, but that will have to wait until next week.

Comment 1 Robbie Harwood 2017-06-30 20:37:25 UTC

My unsubstantiated guess is that NFS is leaning on gssproxy here in way's it shouldn't.  In particular, the segfault is very concerning to me.  Moving to NFS if only to get a better idea what to look at.

Comment 2 Steve Dickson 2017-07-03 14:00:10 UTC

(In reply to Robbie Harwood from comment #1)
> My unsubstantiated guess is that NFS is leaning on gssproxy here in way's it
> shouldn't.  In particular, the segfault is very concerning to me.  Moving to
> NFS if only to get a better idea what to look at.

Could you please be a bit more explicit on "leaning on gssproxy here" means?

Also Is a core being generated from the segfault?

Comment 3 Jason Tibbitts 2017-07-03 17:41:41 UTC

coredumpctl doesn't have anything related to the rpc.gssd segfault.  And note that I actually already had a file /etc/systemd/system/rpc-gssd.service.d/coredump.conf containing:

[Service]
LimitCORE=infinity

which I understand should have enabled coredumps for rpc.gssd.  That's been in there for months so I don't know why I didn't get a core or anything useful at all.

As for "leaning on gssproxy", I know that question wasn't directed at me but I can say that I do not recall having problems of this nature until gssproxy started being involved in client NFS.  I recall that being a relatively recent thing, though it's been involved in server NFS for some time.

Comment 4 Jason Tibbitts 2017-07-03 19:52:52 UTC

For s9ome additional info, I noted a different client had some mount processes stuck in the D state after this.  I killed the mount processes and the following appeared in the kernel log:

[Jul 3 14:50] NFS: nfs4_discover_server_trunking unhandled error -512. Exiting with error EIO
[  +0.000106] NFS: nfs4_discover_server_trunking unhandled error -512. Exiting with error EIO
[  +0.109617] BUG: unable to handle kernel paging request at ffffbbce4926b938
[  +0.000027] IP: memcpy_erms+0x6/0x10
[  +0.000010] PGD 40f09a067 
[  +0.000001] PUD 40f09b067 
[  +0.000007] PMD 404287067 
[  +0.000009] PTE 0

[  +0.000020] Oops: 0000 [#1] SMP
[  +0.000009] Modules linked in: fuse ses enclosure scsi_transport_sas uas usb_storage rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic vfat intel_rapl fat x86_pkg_temp_thermal intel_powerclamp coretemp kvm snd_hda_intel snd_hda_codec irqbypass crct10dif_pclmul snd_hda_core crc32_pclmul snd_hwdep snd_seq ghash_clmulni_intel eeepc_wmi asus_wmi iTCO_wdt snd_seq_device sparse_keymap iTCO_vendor_support mxm_wmi rfkill snd_pcm e1000e intel_cstate intel_uncore snd_timer intel_rapl_perf snd ptp i2c_i801 mei_me lpc_ich mei shpchp soundcore pps_core tpm_infineon wmi tpm_tis tpm_tis_core acpi_pad tpm auth_rpcgss sunrpc binfmt_misc xfs libcrc32c i915 i2c_algo_bit drm_kms_helper drm
[  +0.000180]  crc32c_intel serio_raw video
[  +0.000013] CPU: 0 PID: 4062 Comm: kworker/0:2 Not tainted 4.11.6-201.fc25.x86_64 #1
[  +0.000019] Hardware name: ASUS All Series/H97I-PLUS, BIOS 2704 02/18/2016
[  +0.000027] Workqueue: rpciod rpc_async_schedule [sunrpc]
[  +0.000014] task: ffff98c364964b00 task.stack: ffffbbce4a564000
[  +0.000016] RIP: 0010:memcpy_erms+0x6/0x10
[  +0.000011] RSP: 0018:ffffbbce4a567978 EFLAGS: 00010282
[  +0.000013] RAX: ffff98c231a1c084 RBX: 0000000000000008 RCX: 0000000000000008
[  +0.000018] RDX: 0000000000000008 RSI: ffffbbce4926b938 RDI: ffff98c231a1c084
[  +0.000017] RBP: ffffbbce4a5679a0 R08: ffff98c231a1cfd4 R09: ffff98c3582b8410
[  +0.000018] R10: ffffbbce4a567c90 R11: ffff98c389a49800 R12: 0000000000000000
[  +0.000017] R13: 0000000000000002 R14: ffff98c231a1c084 R15: ffff98c193943a08
[  +0.000018] FS:  0000000000000000(0000) GS:ffff98c39fa00000(0000) knlGS:0000000000000000
[  +0.000019] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000015] CR2: ffffbbce4926b938 CR3: 000000031be09000 CR4: 00000000001406f0
[  +0.000017] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  +0.000017] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  +0.000017] Call Trace:
[  +0.000015]  ? xdr_encode_opaque_fixed+0x37/0x70 [sunrpc]
[  +0.000025]  encode_opaque_fixed+0x2c/0x60 [nfsv4]
[  +0.000017]  encode_exchange_id+0x56/0x1d0 [nfsv4]
[  +0.000014]  ? sha1_generic_block_fn+0x64/0x90
[  +0.000012]  ? sha1_final+0x7e/0x150
[  +0.000010]  ? crypto_sha1_finup+0x5d/0x160
[  +0.000017]  ? xdr_reserve_space+0x1c/0x140 [sunrpc]
[  +0.000018]  ? xdr_encode_opaque+0x19/0x20 [sunrpc]
[  +0.000017]  ? xdr_reserve_space+0x1c/0x140 [sunrpc]
[  +0.000017]  nfs4_xdr_enc_exchange_id+0x7e/0xb0 [nfsv4]
[  +0.000018]  ? encode_exchange_id+0x1d0/0x1d0 [nfsv4]
[  +0.000014]  gss_wrap_req_encode+0x44/0x70 [auth_rpcgss]
[  +0.000015]  ? make_checksum_v2+0x2bc/0x300 [rpcsec_gss_krb5]
[  +0.000016]  gss_wrap_req_integ.isra.7+0x69/0x140 [auth_rpcgss]
[  +0.000015]  ? gss_get_mic_kerberos+0x123/0x340 [rpcsec_gss_krb5]
[  +0.000016]  ? gss_get_mic_kerberos+0x2ea/0x340 [rpcsec_gss_krb5]
[  +0.000016]  gss_wrap_req+0x136/0x1b0 [auth_rpcgss]
[  +0.000013]  ? gss_wrap_req+0x136/0x1b0 [auth_rpcgss]
[  +0.000018]  ? xdr_encode_opaque+0x19/0x20 [sunrpc]
[  +0.000013]  ? gss_marshal+0x163/0x220 [auth_rpcgss]
[  +0.000017]  ? encode_exchange_id+0x1d0/0x1d0 [nfsv4]
[  +0.000026]  rpcauth_wrap_req+0x63/0xe0 [sunrpc]
[  +0.000015]  call_transmit+0x199/0x2a0 [sunrpc]
[  +0.000015]  ? call_decode+0x810/0x810 [sunrpc]
[  +0.000014]  ? call_decode+0x810/0x810 [sunrpc]
[  +0.000015]  __rpc_execute+0x8d/0x420 [sunrpc]
[  +0.000014]  rpc_async_schedule+0x12/0x20 [sunrpc]
[  +0.000013]  process_one_work+0x197/0x450
[  +0.000010]  worker_thread+0x4e/0x4a0
[  +0.000010]  kthread+0x109/0x140
[  +0.000009]  ? process_one_work+0x450/0x450
[  +0.000010]  ? kthread_park+0x90/0x90
[  +0.000011]  ret_from_fork+0x2c/0x40
[  +0.000009] Code: 90 90 90 90 90 eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38
[  +0.000784] RIP: memcpy_erms+0x6/0x10 RSP: ffffbbce4a567978
[  +0.000740] CR2: ffffbbce4926b938
[  +0.005328] ---[ end trace afb1917ca63f57ae ]---

Comment 5 Steve Dickson 2017-07-04 14:59:41 UTC

I have found changing GSS_USE_PROXY to "no" in /etc/sysconfig/nfs
and then reboot seems to stop the hanging mount. 

Does this change fix the hang you are seeing?

Comment 6 Jason Tibbitts 2017-07-05 17:02:49 UTC

I'm assuming you're talking about doing this client-side.  I guess that would disable the use of gssproxy for client NFS, which should have pretty much the same effect as masking gssproxy, which is something I've already done on some hosts.

Of course rebooting resets all the state on the client anyway, so after a reboot I would have to try and reproduce that state.  And currently I don't have a good reproducer which isn't disruptive to my network....   I will hopefully find some time to work on that this week.

Comment 7 Steve Dickson 2017-07-06 11:14:10 UTC

(In reply to Jason Tibbitts from comment #6)
> I'm assuming you're talking about doing this client-side.  I guess that
> would disable the use of gssproxy for client NFS, which should have pretty
> much the same effect as masking gssproxy, which is something I've already
> done on some hosts.
Right on the client side... and yes not starting gssproxy is
the same thing as not using it.... ;-) 

> 
> Of course rebooting resets all the state on the client anyway, so after a
> reboot I would have to try and reproduce that state.  And currently I don't
> have a good reproducer which isn't disruptive to my network....   I will
> hopefully find some time to work on that this week.

Simply restarting rpc.gssd should good enough after changing the
env variable.

Comment 8 Steve Dickson 2017-07-06 11:16:20 UTC

Since enabling gssproxy seems to cause this hang, I'm
going to hand this back over to the gssproxy team 
to have them take a closer look.

Comment 9 Robbie Harwood 2017-07-06 15:01:57 UTC

Core please.

Comment 10 Jason Tibbitts 2017-07-06 16:13:57 UTC

I guess you're asking me for a core file.  But there's no core from anything.  The only thing that segfaults is rpc.gssd, and then only when the system is in this odd state and I restart gssproxy.  And then not all the time.  gssproxy itself never crashes.

Even when systemd is configured to not limit coredumpsize when spawning rpc.gssd, I still don't get a core:

ἐπιθυμία:~❯ s systemctl show rpc-gssd|grep CORE
LimitCORE=18446744073709551615
LimitCORESoft=18446744073709551615

I don't know what else I can do to get one.

As I mentioned before, I did have gssproxy running at debug level 3 on one of the affected hosts.  But I don't know just what to send; the machine where debugging was enabled is both a client and a server, and it logs a lot of stuff.  I can attach the whole log (300K) from that morning until I restarted gssproxy if that would help, but I'm not sure if it exposes anything I shouldn't be exposing.  (Some of the logged lines are long enough to include machine principals, for example.)

Comment 11 Jason Tibbitts 2017-07-06 22:30:23 UTC

So I've tried to reproduce this by setting up an EL7 VM with approximately the same configuration; the only non-default NFS-related things I use are a Domain= line in idmapd.conf and an RPCNFSDCOUNT=32 in /etc/sysconfig/nfs.

And with the same clients (now all running 4.11.8) I just can't make this or anything like this happen.  I tried to reproduce by shutting down the VM and leaving it down for the approximate time it takes one of the real servers to reboot, and by varying the time up to tens of minutes.  And, sure, things hang around for a while trying to contact the server, but those 'df' commands from the monitoring system will move on after a few minutes and everything immediately starts working when the machine comes back up.

The only odd behavior is that after the server has some back, accesses to the mounted directory will sometimes give permission denied errors for a bit.  That appears to recover eventually, though I haven't figured out whether I need to do anything specific to make it happen.

So the only part of my server environment that I might not be managing to reproduce is the fact that the real servers have multiple IP addresses, and for historical reasons the "primary" one isn't on the same network as most of the clients.  But I'm having trouble thinking of ways that could possibly matter.

Comment 12 Robbie Harwood 2017-07-11 18:46:28 UTC

Okay well... we're probably not going to be able to fix the segfault without a coredump, or at least a backtrace.

Logs will depend on what you feel comfortable exposing.  If you're worried about names and such, then definitely don't; actual principals/keys themselves shouldn't be passed back and forth to the daemon in normal usage.

Mostly what I'd like to know is what the conversation with rpc.gssd looked like leading up to the hang, which I can mostly get out of the log files (if debug level is high enough).

If you're not comfortable giving logs, a reproducer would work as well (e.g., Vagrantfile).

Thanks!

Comment 13 Jason Tibbitts 2017-07-11 19:00:23 UTC

I had to reboot a server yesterday and indeed the problem is still there; two clients which had an active mount from that server went into the same bizarre state with df hanging, and a gssproxy restart was the only thing required to get things working properly again.  Of course I simply can't make that happen without actually rebooting one of my real in-production machines.  Doing the same with an export from a VM doesn't show any issues at all.

I'll attach a gssproxy log at debug level 3 from my desktop starting from the time I logged in that morning until I restarted gssproxy to get things working again.  The rpc.gssd segfault happened at 11:37, just before gssproxy restarted.

The machine which produced this log is both an NFS client and a server, so it's going to be a bit verbose.

Comment 14 Jason Tibbitts 2017-07-11 19:00:53 UTC

Created attachment 1296409 [details]
gssproxy log

Comment 15 Robbie Harwood 2017-07-11 22:44:07 UTC

There's nothing in here that indicates to me why the NFS server gets upset to me.  Logs indicate that all GSS calls receive responses.

11:16:17 is the last time I see nfs-client try to do anything... it tries to acquire creds and gets back a failure about missing ccache.  Prior to that, it looks to have mostly been getting a mixture of expiration messages and the occasional missing ccache as well.

What is pid 1071 there, do you know?  Is this just journalctl being strange?  (I don't know how its logging works, whether it's logging pids or tids or what.)

Comment 16 Benjamin Coddington 2017-07-12 18:27:25 UTC

(In reply to Jason Tibbitts from comment #11)
> So the only part of my server environment that I might not be managing to
> reproduce is the fact that the real servers have multiple IP addresses, and
> for historical reasons the "primary" one isn't on the same network as most
> of the clients.  But I'm having trouble thinking of ways that could possibly
> matter.

Hi Jason, what do you mean by "primary" address, and what's do you have for DNS entries for both forward and reverse records for the server's IP addresses?  I wonder if this difference is important to reproducing the problem.

Comment 17 Jason Tibbitts 2017-07-12 19:30:35 UTC

Yes, all the DNS is correct and map to the relevant names properly.  But I realize the that the hostname is reflected in the kerberos principal and so it could somehow be related.  So I'll go into detail:

Take a machine, say, nas04.  It has one active interface with two IPs:

2: enp6s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 00:1b:2b:1c:0d:92 brd ff:ff:ff:ff:ff:ff
    inet 129.7.128.45/24 brd 129.7.128.255 scope global enp6s0f0
       valid_lft forever preferred_lft forever
    inet 172.21.86.68/23 brd 172.21.87.255 scope global enp6s0f0:0
       valid_lft forever preferred_lft forever

nas04.math.uh.edu has address 129.7.128.45
nas04.e.math.uh.edu has address 172.21.86.68
45.128.7.129.in-addr.arpa domain name pointer nas04.math.uh.edu.
68.86.21.172.in-addr.arpa domain name pointer nas04.e.math.uh.edu.

Automount maps just use the unqualified hostnames and so the clients get the right hostname for whichever network they're on.  When a client is accessing a file system, this is what shows up in /proc/mounts:

nas04:/export/storage-06/tibbs /storage/tibbs nfs4 rw,seclabel,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=krb5p,clientaddr=172.21.86.160,local_lock=none,addr=172.21.86.68 0 0

That server has four machine principals:

[root@nas04 ~]# klist -k /etc/krb5.keytab | uniq
Keytab name: FILE:/etc/krb5.keytab
KVNO Principal
---- --------------------------------------------------------------------------
   4 host/nas04.math.uh.edu.EDU
   3 host/nas04.e.math.uh.edu.EDU
   3 nfs/nas04.math.uh.edu.EDU
   3 nfs/nas04.e.math.uh.edu.EDU

The clients just have one principal.  For example:

[root@ld60 ~]# klist -k /etc/krb5.keytab | uniq
Keytab name: FILE:/etc/krb5.keytab
KVNO Principal
---- --------------------------------------------------------------------------
   7 host/ld60.e.math.uh.edu.EDU

Of course all of this works pretty well most of the time.  One obvious exception is when the server reboots.

I will try to reproduce this setup in a VM and see if any of it makes any difference.

Comment 18 Jason Tibbitts 2017-07-12 20:38:17 UTC

So I set up a VM with the same configuration and I think I managed to reproduce.... something.  I cd to the mounted path on a client (so technically the automounter mounts it) and run date > d once per second, and then I took the server VM down.  When just rebooted it, so it was only down a few seconds, I got three of these to dmesg:

[Jul12 15:04] NFS: state manager: check lease failed on NFSv4 server test-el7 with error 13

but the client recovered.  There was a period where I got this:

ls: cannot open directory '.': Permission denied

(which itself is a bit weird and probably a separate big).  But then shortly thereafter everything started working.

So I took the server VM down for a longer period of time, because rebooting real server hardware often takes quite a while.  Of course someone came to my office in the middle of this so it was off for a few minutes:

[Jul12 15:08] nfs: server test-el7 not responding, still trying
[Jul12 15:09] nfs: server test-el7 not responding, timed out
[Jul12 15:12] nfs: server test-el7 not responding, timed out
[  +9.151438] nfs: server test-el7 not responding, timed out
[Jul12 15:15] nfs: server test-el7 OK

But currently things are in a bad state; the shell loop is hung and df as root and as a user with a principal is hung as well.  df when run by nobody is working fine, though, for whatever reason.

I restarted gssproxy and immediately everything is going again.  I will do some more experimentation and attach some logs.

Comment 19 Jason Tibbitts 2017-08-04 18:58:02 UTC

For the record, I filed that weird permission problem (which actually leads to data loss) as bug 1470467.  But I suspect my various NVSv4/krb5 problems are all interrelated.

Comment 20 Jason Tibbitts 2017-08-22 21:02:58 UTC

Created attachment 1316812 [details]
log with debugging patch over a hang and restart

Since I filed this ticket, I have updated all of my machines to Fedora 26 which as expected didn't really change anything.

I'm currently running the rawhide version of gssproxy built for F26 on all of my machines and have debugging turned up.  Within a few minutes of pushing that out, I had another machine in the hanging state.  I have attached the log from the moment gssproxy was started with debug level 3 at 15:36.  Note that at least once per minute a process is firing up to run 'df', and those just hang; gssproxy logs nothing at this point and probably isn't even being called.

At 15:51 I think I tried to log into the machine s myself which generated some more activity as it attempts to mount my home directory (which hangs).

At 15:53 I logged in as root and restarted gssproxy.  All of the hanging df commands finish at this point.

Comment 21 Simo Sorce 2017-08-22 21:17:12 UTC

Did you restart all rpc.gssd processes when youupdated gssproxy on your clients ?

Comment 22 Jason Tibbitts 2017-08-22 22:01:51 UTC

I didn't, but I can just go and do that now and the let you know if anything else happens.

It would be good to know what I should look for in the logs to see if something has gone wrong.  If the timeout patch manages to turn the hang into a delay then it might look to me as if things are fixed while the underlying problem continues.

Comment 23 Orion Poplawski 2017-08-29 20:55:08 UTC

FWIW - Just came across a mount hanging for a particular user.  Restarting gssproxy cleared the hang.  This is on Scientific Linux 7 with:

gssproxy-0.7.0-4.el7.x86_64
kernel 3.10.0-693.1.1.el7.x86_64

I also seem to be seeing mount commands getting somewhat stuck more these days on my workstation - but those can be killed.

Comment 24 Jason Tibbitts 2017-08-31 03:03:09 UTC

Talked with Simo at length here at Flock, so I have a better understanding of how the various pieces interact.  But in the end we're still missing a backtrace of rpc.gssd while hung processes are accumulating.  Right now since Houston is under water there's nobody in the office to cause problems, but I just found one machine that started accumulating df processes in the 'D; state.  Note that neither rpc.gssd nor gssproxy are hung up in any way.  (ps shows the in the 'Ssl' state.)

But here's the weird thing: at the moment when I disconect gdb from the running rpc.gssd process, the hung process imediately unblock.  I guess doing that sends some signal or otherwise kicks the process and gets it going again.  Which I guess implies that the problem (or what remains of it) is not in gssproxy.

Thread 2 (Thread 0x7f3354306700 (LWP 17146)):
#0  0x00007f3355b39123 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f3354b50433 in gpm_epoll_wait (gpmctx=0x7f3354d5c1a0 <gpm_global_ctx>, event_flags=<optimized out>) at src/client/gpm_common.c:283
#2  0x00007f3354b50bdd in gpm_recv_buffer (length=<optimized out>, buffer=<optimized out>, gpmctx=<optimized out>) at src/client/gpm_common.c:425
#3  gpm_send_recv_loop (recv_length=<synthetic pointer>, recv_buffer=0x7f3354305558, send_length=116, send_buffer=<optimized out>, gpmctx=0x7f3354d5c1a0 <gpm_global_ctx>) at src/client/gpm_common.c:550
#4  gpm_make_call (proc=proc@entry=6, arg=arg@entry=0x7f3354305720, res=res@entry=0x7f3354305680) at src/client/gpm_common.c:724
#5  0x00007f3354b4cbff in gpm_acquire_cred (minor_status=minor_status@entry=0x7f3354305948, in_cred_handle=in_cred_handle@entry=0x0, desired_name=<optimized out>, time_req=time_req@entry=4294967295, 
    desired_mechs=desired_mechs@entry=0x7f33480195a0, cred_usage=cred_usage@entry=1, impersonate=false, output_cred_handle=0x7f33480032f0, actual_mechs=0x0, time_rec=0x0) at src/client/gpm_acquire_cred.c:106
#6  0x00007f3354b51a05 in gssi_acquire_cred_from (minor_status=0x7f3354305b04, desired_name=0x0, time_req=4294967295, desired_mechs=0x7f33480195a0, cred_usage=1, cred_store=0x0, output_cred_handle=0x7f3354305a10, actual_mechs=0x0, 
    time_rec=0x0) at src/mechglue/gpp_acquire_cred.c:165
#7  0x00007f3356976030 in gss_add_cred_from (minor_status=minor_status@entry=0x7f3354305b04, input_cred_handle=0x7f3348002190, desired_name=desired_name@entry=0x0, desired_mech=<optimized out>, cred_usage=cred_usage@entry=1, 
    initiator_time_req=initiator_time_req@entry=4294967295, acceptor_time_req=4294967295, cred_store=0x0, output_cred_handle=0x0, actual_mechs=0x0, initiator_time_rec=0x0, acceptor_time_rec=0x0) at g_acquire_cred.c:455
#8  0x00007f33569765be in gss_acquire_cred_from (minor_status=minor_status@entry=0x7f3354305bfc, desired_name=desired_name@entry=0x0, time_req=time_req@entry=4294967295, desired_mechs=desired_mechs@entry=0x7f3354305c00, 
    cred_usage=cred_usage@entry=1, cred_store=cred_store@entry=0x0, output_cred_handle=0x7f3354305c90, actual_mechs=0x0, time_rec=0x0) at g_acquire_cred.c:190
#9  0x00007f3356976846 in gss_acquire_cred (minor_status=minor_status@entry=0x7f3354305bfc, desired_name=desired_name@entry=0x0, time_req=time_req@entry=4294967295, desired_mechs=desired_mechs@entry=0x7f3354305c00, 
    cred_usage=cred_usage@entry=1, output_cred_handle=output_cred_handle@entry=0x7f3354305c90, actual_mechs=0x0, time_rec=0x0) at g_acquire_cred.c:107
#10 0x0000557532f210a2 in gssd_acquire_krb5_cred (gss_cred=gss_cred@entry=0x7f3354305c90) at krb5_util.c:1362
#11 0x0000557532f233de in gssd_acquire_user_cred (gss_cred=gss_cred@entry=0x7f3354305c90) at krb5_util.c:1382
#12 0x0000557532f201b8 in krb5_not_machine_creds (clp=clp@entry=0x55753376ea10, uid=uid@entry=99, tgtname=tgtname@entry=0x0, downcall_err=downcall_err@entry=0x7f3354305d04, chg_err=chg_err@entry=0x7f3354305d00, 
    rpc_clnt=rpc_clnt@entry=0x7f3354305d10) at gssd_proc.c:498
#13 0x0000557532f20704 in process_krb5_upcall (clp=clp@entry=0x55753376ea10, uid=uid@entry=99, fd=12, tgtname=tgtname@entry=0x0, service=service@entry=0x0) at gssd_proc.c:638
#14 0x0000557532f20e48 in handle_gssd_upcall (info=0x557533780b90) at gssd_proc.c:805
#15 0x00007f3355e0036d in start_thread () from /lib64/libpthread.so.0
#16 0x00007f3355b38bbf in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f3356fec5c0 (LWP 9579)):
#0  0x00007f3355b39123 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f3356bd90fb in epoll_dispatch (base=0x55753376dcb0, tv=<optimized out>) at epoll.c:407
#2  0x00007f3356bc3dbe in event_base_loop (base=0x55753376dcb0, flags=0) at event.c:1625
#3  0x0000557532f1d6da in main (argc=<optimized out>, argv=0x7fff8d34b218) at gssd.c:1046

Comment 25 Simo Sorce 2017-08-31 17:45:37 UTC

I still see the mechglue plugin hun on epoll_wait() which *shouldn't* happen as we have a timerfd in there too. Anyway I'm attaching a patch to add an additional timeout to epoll_wait() so that it will eventually return even if everything is screwed up.

Comment 26 Simo Sorce 2017-08-31 17:47:29 UTC

Created attachment 1320678 [details]
Add timeout harder patch

Comment 27 Simo Sorce 2017-08-31 18:57:17 UTC

Created attachment 1320698 [details]
Very WIP patch to allow some client side debugging

Comment 28 Jason Tibbitts 2017-09-06 17:50:39 UTC

Just a note that after the 'timeout harder' patch I have yet to see any problems.  I'm letting that churn a bit before trying that debugging patch.  Though for just one nasty debugging case I'd just change it to be unconditional.

But where do we go from here?  Certainly if we're hitting a 30 second timeout then there's still a bug; I'm just not sure what it is.  I recall that Simo mentioned the possibility that gssproxy is somehow dropping an occasional request.  I have no idea how you would find that.

Comment 29 Fedora Update System 2017-10-27 15:30:20 UTC

gssproxy-0.7.0-22.fc27 has been submitted as an update to Fedora 27. https://bodhi.fedoraproject.org/updates/FEDORA-2017-833ed1a535

Comment 30 Fedora Update System 2017-10-27 18:50:31 UTC

gssproxy-0.7.0-22.fc27 has been pushed to the Fedora 27 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2017-833ed1a535

Comment 31 Fedora Update System 2017-11-11 03:07:52 UTC

gssproxy-0.7.0-22.fc27 has been pushed to the Fedora 27 stable repository. If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.