Bug 1519511

Summary: gssproxy hangs in a read when during an nfs mount only when systemd starts rpc.gssd
Product: Red Hat Enterprise Linux 7 Reporter: Steve Dickson <steved>
Component: gssproxyAssignee: Robbie Harwood <rharwood>
Status: CLOSED ERRATA QA Contact: ipa-qe <ipa-qe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.5CC: dzheng, fs-qe, hhan, junli, ksiddiqu, lmiksik, pasik, rharwood, steved, xdong, yisun, yoyang
Target Milestone: beta   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: gssproxy-0.7.0-16.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-10 11:09:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Steve Dickson 2017-11-30 19:44:49 UTC
Description of problem:
The gssproxy hangs in a read on its socket pipe only when rpc.gssd
is started by systemd.

Here is the hug thread:
gdb) info threads
  Id   Target Id         Frame 
  2    Thread 0x7fd0d12cd700 (LWP 1866) "rpc.gssd" 0x00007fd0d2dd176d in read ()
    at ../sysdeps/unix/syscall-template.S:81
* 1    Thread 0x7fd0d3fc1840 (LWP 780) "rpc.gssd" 0x00007fd0d2af4f93 in epoll_wait ()
    at ../sysdeps/unix/syscall-template.S:81

(gdb) thread 2
[Switching to thread 2 (Thread 0x7fd0d12cd700 (LWP 1866))]
#0  0x00007fd0d2dd176d in read () at ../sysdeps/unix/syscall-template.S:81
81	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
(gdb) bt
#0  0x00007fd0d2dd176d in read () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007fd0d1917bd4 in read (__nbytes=4, __buf=0x7fd0d11cc04c, __fd=<optimized out>)
    at /usr/include/bits/unistd.h:44
#2  gpm_recv_buffer (gpmctx=<optimized out>, length=<optimized out>, buffer=<optimized out>)
    at src/client/gpm_common.c:243
#3  gpm_make_call (proc=proc@entry=6, arg=arg@entry=0x7fd0d12cc210, 
    res=res@entry=0x7fd0d12cc170) at src/client/gpm_common.c:460
#4  0x00007fd0d1913fd9 in gpm_acquire_cred (minor_status=minor_status@entry=0x7fd0d12cc438, 
    in_cred_handle=in_cred_handle@entry=0x0, desired_name=<optimized out>, 
    time_req=time_req@entry=4294967295, desired_mechs=desired_mechs@entry=0x7fd0cc002440, 
    cred_usage=cred_usage@entry=1, impersonate=impersonate@entry=false, 
    output_cred_handle=0x7fd0cc0021e0, actual_mechs=actual_mechs@entry=0x0, 
    time_rec=time_rec@entry=0x0) at src/client/gpm_acquire_cred.c:104
#5  0x00007fd0d1918b50 in gssi_acquire_cred_from (minor_status=0x7fd0d12cc5d4, 

Unfortunately the hung thread is holding a mutex which means all
other nfs mounts will hang as well.

Its not clear whether this is a gssproxy issue or a sysstemd
issue... I'll start with gssproxy since that's where code
is hug.
 
Version-Release number of selected component (if applicable):
nfs-utils-1.3.0-0.52.el7.x86_64
krb5-libs-1.15.1-17.el7.x86_64
gssproxy-0.7.0-13.el7.x86_64


How reproducible:
100%

Steps to Reproduce:
1. Set up an IPA server which should be used as the KDC
2. Make the nfs clients and server registered hosts and service with the IPA server
3. on the nfs client: mount -o sec=krb5 nfsserver:/mnt/tmp /mnt/tmp

Comment 2 Steve Dickson 2017-11-30 19:45:38 UTC
The entire back trace:
(gdb) thread 2
[Switching to thread 2 (Thread 0x7fd0d12cd700 (LWP 1866))]
#0  0x00007fd0d2dd176d in read () at ../sysdeps/unix/syscall-template.S:81
81	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
(gdb) bt
#0  0x00007fd0d2dd176d in read () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007fd0d1917bd4 in read (__nbytes=4, __buf=0x7fd0d11cc04c, __fd=<optimized out>)
    at /usr/include/bits/unistd.h:44
#2  gpm_recv_buffer (gpmctx=<optimized out>, length=<optimized out>, buffer=<optimized out>)
    at src/client/gpm_common.c:243
#3  gpm_make_call (proc=proc@entry=6, arg=arg@entry=0x7fd0d12cc210, 
    res=res@entry=0x7fd0d12cc170) at src/client/gpm_common.c:460
#4  0x00007fd0d1913fd9 in gpm_acquire_cred (minor_status=minor_status@entry=0x7fd0d12cc438, 
    in_cred_handle=in_cred_handle@entry=0x0, desired_name=<optimized out>, 
    time_req=time_req@entry=4294967295, desired_mechs=desired_mechs@entry=0x7fd0cc002440, 
    cred_usage=cred_usage@entry=1, impersonate=impersonate@entry=false, 
    output_cred_handle=0x7fd0cc0021e0, actual_mechs=actual_mechs@entry=0x0, 
    time_rec=time_rec@entry=0x0) at src/client/gpm_acquire_cred.c:104
#5  0x00007fd0d1918b50 in gssi_acquire_cred_from (minor_status=0x7fd0d12cc5d4, 
    desired_name=0x0, time_req=4294967295, desired_mechs=0x7fd0cc002440, cred_usage=1, 
    cred_store=0x0, output_cred_handle=0x7fd0d12cc500, actual_mechs=0x0, time_rec=0x0)
    at src/mechglue/gpp_acquire_cred.c:165
#6  0x00007fd0d3938aa0 in gss_add_cred_from (minor_status=minor_status@entry=0x7fd0d12cc5d4, 
    input_cred_handle=<optimized out>, desired_name=desired_name@entry=0x0, 
    desired_mech=<optimized out>, cred_usage=cred_usage@entry=1, 
    initiator_time_req=initiator_time_req@entry=4294967295, 
    acceptor_time_req=acceptor_time_req@entry=4294967295, cred_store=cred_store@entry=0x0, 
    output_cred_handle=output_cred_handle@entry=0x0, actual_mechs=actual_mechs@entry=0x0, 
    initiator_time_rec=0x0, acceptor_time_rec=0x0) at g_acquire_cred.c:455
#7  0x00007fd0d39390e9 in gss_acquire_cred_from (
    minor_status=minor_status@entry=0x7fd0d12cc6cc, desired_name=desired_name@entry=0x0, 
    time_req=time_req@entry=4294967295, desired_mechs=desired_mechs@entry=0x7fd0d12cc6d0, 
    cred_usage=cred_usage@entry=1, cred_store=cred_store@entry=0x0, 
    output_cred_handle=output_cred_handle@entry=0x7fd0d12cc7c0, 
    actual_mechs=actual_mechs@entry=0x0, time_rec=time_rec@entry=0x0) at g_acquire_cred.c:190
#8  0x00007fd0d3939304 in gss_acquire_cred (minor_status=minor_status@entry=0x7fd0d12cc6cc, 
---Type <return> to continue, or q <return> to quit--- 
    desired_name=desired_name@entry=0x0, time_req=time_req@entry=4294967295, 
    desired_mechs=desired_mechs@entry=0x7fd0d12cc6d0, cred_usage=cred_usage@entry=1, 
    output_cred_handle=output_cred_handle@entry=0x7fd0d12cc7c0, 
    actual_mechs=actual_mechs@entry=0x0, time_rec=time_rec@entry=0x0) at g_acquire_cred.c:107
#9  0x00005569456af1df in gssd_acquire_krb5_cred (gss_cred=gss_cred@entry=0x7fd0d12cc7c0)
    at krb5_util.c:1362
#10 0x00005569456b1389 in limit_krb5_enctypes (sec=sec@entry=0x7fd0d12cc7b0)
    at krb5_util.c:1426
#11 0x00005569456adc6b in create_auth_rpc_client (clp=clp@entry=0x556945db8450, 
    tgtname=tgtname@entry=0x0, clnt_return=clnt_return@entry=0x7fd0d12cccf0, 
    auth_return=auth_return@entry=0x7fd0d12ccc48, uid=uid@entry=0, cred=cred@entry=0x0, 
    authtype=0) at gssd_proc.c:337
#12 0x00005569456ae4e6 in krb5_use_machine_creds (clp=clp@entry=0x556945db8450, 
    uid=uid@entry=0, tgtname=tgtname@entry=0x0, service=service@entry=0x556945dc7330 "*", 
    rpc_clnt=rpc_clnt@entry=0x7fd0d12cccf0) at gssd_proc.c:558
#13 0x00005569456ae68d in process_krb5_upcall (clp=clp@entry=0x556945db8450, uid=uid@entry=0, 
    fd=9, tgtname=tgtname@entry=0x0, service=service@entry=0x556945dc7330 "*")
    at gssd_proc.c:645
#14 0x00005569456aeed9 in handle_gssd_upcall (info=0x556945dc7310) at gssd_proc.c:804
#15 0x00007fd0d2dcadd5 in start_thread (arg=0x7fd0d12cd700) at pthread_create.c:308
#16 0x00007fd0d2af49bd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Comment 3 Steve Dickson 2017-11-30 19:55:21 UTC
The work around to this problem is setting 
GSS_USE_PROXY="no" in /etc/sysconfig/nfs

Comment 4 Robbie Harwood 2017-11-30 20:43:20 UTC
1. Does it happen in Fedora?

2. How do I reproduce this?

Comment 5 Steve Dickson 2017-11-30 21:51:48 UTC
(In reply to Robbie Harwood from comment #4)
> 1. Does it happen in Fedora?
IDK... I have not tried... I'll look into it.

> 
> 2. How do I reproduce this?

Steps to Reproduce:
1. Set up an IPA server which should be used as the KDC
2. Make the nfs clients and server registered hosts and service with the IPA server
3. on the nfs client: mount -o sec=krb5 nfsserver:/mnt/tmp /mnt/tmp

Question: What is the read from the socket trying to get?
Maybe something is not running or has failed to come up?

Comment 6 Junxiang Li 2017-12-01 07:39:21 UTC
It is easily reproduce on a new system of machine.
The first appearance is on October 30(1016.n.1), but not very often.
This week the bug always reproduces in our auto test(new system).
The fatal message:
Failed to lock byte 100

Comment 7 Steve Dickson 2017-12-01 17:01:44 UTC
(In reply to Robbie Harwood from comment #4)
> 1. Does it happen in Fedora?

Update F27: No (nfs-utils-2.2.1-1.rc1.fc27.x86_64 gssproxy-0.7.0-24.fc27.x86_64)


Updated F26: Yes (nfs-utils-2.2.1-0.fc28.x86_64, gssproxy-0.7.0-9.fc26.x86_64)
(gdb) info threads
  Id   Target Id         Frame 
  1    Thread 0x7fb552932600 (LWP 716) "rpc.gssd" 0x00007fb551241383 in epoll_wait () from /lib64/libc.so.6
  2    Thread 0x7fb54fe29700 (LWP 1092) "rpc.gssd" 0x00007fb55173028d in read
    () from /lib64/libpthread.so.0
* 3    Thread 0x7fb54f21f700 (LWP 1095) "rpc.gssd" 0x00007fb55173028d in read
    () from /lib64/libpthread.so.0
  4    Thread 0x7fb54ea1e700 (LWP 1096) "rpc.gssd" 0x00007fb55173028d in read
    () from /lib64/libpthread.so.0
  5    Thread 0x7fb54e21d700 (LWP 1184) "rpc.gssd" 0x00007fb55173028d in read
    () from /lib64/libpthread.so.0

Updated F25: No (nfs-utils-2.1.1-5.rc4.fc27.x86_64 gssproxy-0.7.0-9.fc25.x86_64)

Update rawhide: Yes but hung in epoll_wait instead read
(gdb) bt
#0  0x00007fe5939198d7 in epoll_wait () from /lib64/libc.so.6
#1  0x00007fe592b54a63 in gpm_epoll_wait.constprop ()
   from /usr/lib64/gssproxy/proxymech.so
#2  0x00007fe592b55162 in gpm_make_call ()
   from /usr/lib64/gssproxy/proxymech.so
#3  0x00007fe592b512ef in gpm_acquire_cred ()
   from /usr/lib64/gssproxy/proxymech.so
#4  0x00007fe592b55fe5 in gssi_acquire_cred_from ()
   from /usr/lib64/gssproxy/proxymech.so
#5  0x00007fe594971ca0 in gss_add_cred_from () from /lib64/libgssapi_krb5.so.2
#6  0x00007fe59497222e in gss_acquire_cred_from ()
   from /lib64/libgssapi_krb5.so.2
#7  0x00007fe5949724b6 in gss_acquire_cred () from /lib64/libgssapi_krb5.so.2
#8  0x0000563d21f4d0b2 in gssd_acquire_krb5_cred ()
#9  0x0000563d21f4f534 in limit_krb5_enctypes ()
#10 0x0000563d21f4bd65 in create_auth_rpc_client.constprop ()
#11 0x0000563d21f4c458 in krb5_use_machine_creds ()
#12 0x0000563d21f4c622 in process_krb5_upcall ()
#13 0x0000563d21f4ce58 in handle_gssd_upcall ()
#14 0x00007fe593dfa55b in start_thread () from /lib64/libpthread.so.0
#15 0x00007fe5939195af in clone () from /lib64/libc.so.6

nfs-utils-2.2.1-1.rc1.fc28.1.x86_64 gssproxy-0.7.0-24.fc28.x86_64

IHTH...

Comment 8 Robbie Harwood 2017-12-01 20:20:16 UTC
There's no difference between the f27 and the rawhide gssproxy (hence why the NVR matches), so that's kind of odd.

My test setup for f27 is working as of this morning's fixes; and rawhide is as well (gssproxy-0.7.0-25, nfs-utils-2.2.1-1.rc1.fc28.1.x86_64).  (This is with a non-IPA KDC.)

(I'm also running without KCM (yum erase sssd-kcm) because it's broken at the moment for this use case.)

If you could double-check with -25 that would be helpful; otherwise I'll check with a freeipa KDC, but it may be a while since I don't have that env handy.

Comment 9 Simo Sorce 2017-12-01 21:09:46 UTC
Steve did you restart rpc.gssd after you updated gssproxy before repeating your tests ?

Comment 10 Steve Dickson 2017-12-02 16:54:32 UTC
(In reply to Robbie Harwood from comment #8)
> If you could double-check with -25 that would be helpful; otherwise I'll
> check with a freeipa KDC, but it may be a while since I don't have that env
> handy.

With both -24 and -25 the -o sec=krb5 on F27 do succeed but very slowly.
The mount.nfs process is hung in the same spot gss_cred_init() but
there must be some type of timeout that is popping causing the
process to wake up and continue.

On Rawhide... its the same thing... The mount take even longer.
Here is what I'm looking at:

f27# time  mount  rhel7srv:/home/tmp /mnt/tmp

real	0m30.163s
user	0m0.004s
sys	0m0.021s

fedora# time mount -o sec=krb5 rhel7srv:/home/tmp /mnt/tmp

real	1m0.185s
user	0m0.010s
sys	0m0.029s

Comment 11 Steve Dickson 2017-12-02 16:55:16 UTC
(In reply to Simo Sorce from comment #9)
> Steve did you restart rpc.gssd after you updated gssproxy before repeating
> your tests ?

Yes... I updated and reboot both VMs.

Comment 12 Robbie Harwood 2017-12-04 19:58:27 UTC
I can't reproduce on rawhide or f27:

[root@nfs-client vagrant]# time mount -v nfs-server.mivehind.net:/home /mnt/nfs/home -o vers=4.2
mount.nfs: timeout set for Mon Dec  4 19:59:13 2017
mount.nfs: trying text-based options 'vers=4.2,addr=192.168.56.10,clientaddr=192.168.56.15'

real	0m1.644s
user	0m0.006s
sys	0m0.024s
[root@nfs-client vagrant]# mount | grep /mnt/nfs/home
nfs-server.mivehind.net:/home on /mnt/nfs/home type nfs4 (rw,relatime,vers=4.2,rsize=262144,wsize=262144,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=krb5,clientaddr=192.168.56.15,local_lock=none,addr=192.168.56.10)
[root@nfs-client vagrant]# rpm -qa | grep gssproxy
gssproxy-debuginfo-0.7.0-25.fc27.x86_64
gssproxy-debugsource-0.7.0-25.fc27.x86_64
gssproxy-0.7.0-25.fc27.x86_64
[root@nfs-client vagrant]# rpm -qa | grep sssd-kcm
[root@nfs-client vagrant]# rpm -qa | grep nfs-utils
nfs-utils-2.2.1-1.rc1.fc27.x86_64
[root@nfs-client vagrant]#

Comment 13 Junxiang Li 2017-12-11 02:08:51 UTC
This bug has caused many auto test cases to fail.
Please pay more attention.
Thanks.

Comment 14 Steve Dickson 2017-12-13 14:08:36 UTC
The gssproxy fix in this bz 
     https://bugzilla.redhat.com/show_bug.cgi?id=1507817

should fix this problem....

Comment 17 Kaleem 2018-01-02 09:16:32 UTC
(In reply to Steve Dickson from comment #14)
> The gssproxy fix in this bz 
>      https://bugzilla.redhat.com/show_bug.cgi?id=1507817
> 
> should fix this problem....

Steve ,

Can we mark this verified too since https://bugzilla.redhat.com/show_bug.cgi?id=1507817 seems to be verfied as per https://bugzilla.redhat.com/show_bug.cgi?id=1507817#c96 ?

Comment 18 Steve Dickson 2018-01-02 15:23:38 UTC
(In reply to Kaleem from comment #17)
> (In reply to Steve Dickson from comment #14)
> > The gssproxy fix in this bz 
> >      https://bugzilla.redhat.com/show_bug.cgi?id=1507817
> > 
> > should fix this problem....
> 
> Steve ,
> 
> Can we mark this verified too since
> https://bugzilla.redhat.com/show_bug.cgi?id=1507817 seems to be verfied as
> per https://bugzilla.redhat.com/show_bug.cgi?id=1507817#c96 ?

I would think so...

Comment 19 Kaleem 2018-01-04 13:39:45 UTC
Marking verified as per #c18

Comment 23 errata-xmlrpc 2018-04-10 11:09:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0709