Bug 1296607

Summary: have a configurable connection timeout for the rpcgssd service
Product: Red Hat Enterprise Linux 6 Reporter: Olga Kornieskaia <kolga>
Component: nfs-utilsAssignee: Steve Dickson <steved>
Status: CLOSED ERRATA QA Contact: Yongcheng Yang <yoyang>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 6.7CC: eguan, fs-qe, kolga, yoyang
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: nfs-utils-1.2.3-68.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-05-11 00:04:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1009528    
Bug Blocks:    
Attachments:
Description Flags
modified patch of the upstream commit none

Description Olga Kornieskaia 2016-01-07 16:12:27 UTC
Created attachment 1112491 [details]
modified patch of the upstream commit

Description of problem:
Asking to include the fix for bugzilla 1009528 into RHEL6.x version. It's the same problem that when the server takes too long to respond, it leads gssd to timeout and log something like: 
"Jan  6 05:21:01 scspr0055500001 rpc.gssd[3152]: WARNING: can't create tcp 
rpc_clnt to server ae1nb141541.dmag212.com for user with uid 70120: RPC: Remote system error - Connection timed out"

Then, the kernel fails the operation with "permission denied". This problem has been observed by the NetApp QA on RHEL6.6 nfs-utils-1.2.3-54 version.

Fix is to include upstream commit b7b9e4e69587652b40d543094b133fb8b4881dc1
commit b7b9e4e69587652b40d543094b133fb8b4881dc1
Author: Rinku Kothiya <rkothiya>
Date:   Fri Jul 25 10:50:55 2014 -0400

   gssd: configurable connection timeout for the rpcgssd service

As the patch doesn't cleanly apply, I'm attaching a slightly modified version of that patch that does apply.

Version-Release number of selected component (if applicable):

How reproducible:


Steps to Reproduce:
1. mount with krb5
2. kinit as some user
3. create a test directory and there as user execute a test application (hammer) that has a variety of open, lock, read, write, rename, delete  ops etc
4. mount with krb5i
5. kinit as some user
6. same as 3
7. repeat 1-6 3 more times.

the failure is for when context is getting established for the user.


Actual results:


Expected results:


Additional info:

Comment 3 Steve Dickson 2016-01-08 20:19:06 UTC
> What do they have
> to do to reproduce this problem?

From what I understand, the test case does something like
1. mount with krb5
2. kinit as some user
3. create a test directory and there as user execute a test application (hammer) that has a variety of open, lock, read, write, rename, delete  ops etc
4. mount with krb5i
5. kinit as some user
6. same as 3
7. repeat 1-6 3 more times.

the failure is for when context is getting established for the user.

Comment 5 Olga Kornieskaia 2016-01-08 20:31:03 UTC
I was also able to reproduce this locally using "tc".

1. sudo mount -t nfs4 -osec=krb5 192.168.1.120:/ /mnt
2. kinit
3. tc qdisc add dev eth0 root handle 1: prio
4. tc filter add dev eth0 parent 1:0 protocol ip pref 55 handle ::55 u32 match ip dst 192.168.1.120 flowid 2:1
5. tc qdisc change dev eth0 parent 1:1 handle 2: netem delay 10000ms 10000ms
6. echo testing > /mnt/test

this fails with "permission denied"

Then run gssd with "-T 60" and see that test succeeds.

Comment 6 Yongcheng Yang 2016-01-11 10:28:01 UTC
(In reply to Olga Kornieskaia from comment #5)
> I was also able to reproduce this locally using "tc".
> 
> 1. sudo mount -t nfs4 -osec=krb5 192.168.1.120:/ /mnt
> 2. kinit
> 3. tc qdisc add dev eth0 root handle 1: prio
> 4. tc filter add dev eth0 parent 1:0 protocol ip pref 55 handle ::55 u32
> match ip dst 192.168.1.120 flowid 2:1
> 5. tc qdisc change dev eth0 parent 1:1 handle 2: netem delay 10000ms 10000ms
> 6. echo testing > /mnt/test
> 
> this fails with "permission denied"
> 
> Then run gssd with "-T 60" and see that test succeeds.

Hi Olga,
I cannot reproduce it using "tc" in previous version nfs-utils-1.2.3-64.el6.
Please help to have a look at the following steps:
[18:17:11 root@ ~~]# mount -t nfs4 -o sec=krb5 $HOSTNAME:/exportDir-krb5-common-local /mnt/nfsmp-krb5-common-local
[18:17:11 root@ ~~]# kinit root <<< 'redhat'
Password for root: 
[18:17:11 root@ ~~]# tc qdisc add dev eth0 root handle 1: prio
[18:17:11 root@ ~~]# tc qdisc show dev eth0
qdisc prio 1: root refcnt 9 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
[18:17:11 root@ ~~]# tc filter add dev eth0 parent 1:0 protocol ip pref 55 handle ::55 u32 match ip dst 10.73.4.241 flowid 2:1
[18:17:11 root@ ~~]# tc filter show dev eth0
filter parent 1: protocol ip pref 55 u32 
filter parent 1: protocol ip pref 55 u32 fh 800: ht divisor 1 
filter parent 1: protocol ip pref 55 u32 fh 800::55 order 85 key ht 800 bkt 0 flowid 2:1 
  match 0a4904f1/ffffffff at 16
[18:17:12 root@ ~~]# tc qdisc add dev eth0 parent 1:1 handle 2: netem delay 10000ms 10000ms
^^^^^^^^^^^^^ "tc qdisc change" fail "RTNETLINK answers: No such file or directory"
^^^^^^^^^^^^^ So I replace the "change" to be "add" as someone said
[18:17:12 root@ ~~]# tc qdisc show dev eth0
qdisc prio 1: root refcnt 9 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc netem 2: parent 1:1 limit 1000 delay 10.0s  10.0s
[18:17:12 root@ ~~]# date >/mnt/nfsmp-krb5-common-local/testfile
[18:17:12 root@ ~~]# umount /mnt/nfsmp-krb5-common-local
[18:17:12 root@ ~~]# cat /var/log/messages | grep 'Connection timed out'
[18:17:12 root@ ~~]#

Comment 7 Olga Kornieskaia 2016-01-11 13:32:16 UTC
Have you verified that you see a delay happening? Is this a localhost mount? If so you are setting the delay for eth0. Given your time output of the shell: 18:17:12 is your start of the "date" command and then "unmount" again at "18:17:12" so if the delay was 10s the delay definitely didn't happen. that command should have taken some time to complete.

Comment 8 Yongcheng Yang 2016-01-12 01:59:11 UTC
(In reply to Olga Kornieskaia from comment #7)
> Have you verified that you see a delay happening? Is this a localhost mount?
> If so you are setting the delay for eth0. Given your time output of the
> shell: 18:17:12 is your start of the "date" command and then "unmount" again
> at "18:17:12" so if the delay was 10s the delay definitely didn't happen.
> that command should have taken some time to complete.

Yes, this is a localhost mount and I have see the delay happened.
I can see the delay between my typing and screen showing, but seems the command exec without delay. I'm wondering whether it's the matter that "tc qdisc add" or "tc qdisc change" (in Comment 6)

[root@hp-dl160g8-01 ~]# rpm -q nfs-utils
nfs-utils-1.2.3-64.el6.x86_64
[root@hp-dl160g8-01 ~]# mount -t nfs4 -o sec=krb5 hp-dl160g8-01.rhts.eng.pek2.redhat.com:/exportDir-krb5-common-local /mnt/nfsmp-krb5-common-local
[root@hp-dl160g8-01 ~]# kinit root <<< 'redhat'
Password for root: 
[root@hp-dl160g8-01 ~]# tc qdisc add dev eth0 root handle 1: prio
[root@hp-dl160g8-01 ~]# tc filter add dev eth0 parent 1:0 protocol ip pref 55 handle ::55 u32 match ip dst 10.73.4.241 flowid 2:1
[root@hp-dl160g8-01 ~]# tc qdisc add dev eth0 parent 1:1 handle 2: netem delay 10000ms 10000ms
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ delay happening after this
[root@hp-dl160g8-01 ~]#  tc qdisc show dev eth0
qdisc prio 1: root refcnt 9 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc netem 2: parent 1:1 limit 1000 delay 10.0s  10.0s
[root@hp-dl160g8-01 ~]# tc filter show dev eth0
filter parent 1: protocol ip pref 55 u32 
filter parent 1: protocol ip pref 55 u32 fh 800: ht divisor 1 
filter parent 1: protocol ip pref 55 u32 fh 800::55 order 85 key ht 800 bkt 0 flowid 2:1 
  match 0a4904f1/ffffffff at 16
[root@hp-dl160g8-01 ~]# date; date >/mnt/nfsmp-krb5-common-local/testfile; date
Tue Jan 12 09:50:29 CST 2016
Tue Jan 12 09:50:29 CST 2016
[root@hp-dl160g8-01 ~]# date; umount /mnt/nfsmp-krb5-common-local; date
Tue Jan 12 09:51:10 CST 2016
Tue Jan 12 09:51:10 CST 2016
[root@hp-dl160g8-01 ~]# tc qdisc del dev eth0 root
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ delay disappeared after this
[root@hp-dl160g8-01 ~]# tc qdisc del dev eth0 root
RTNETLINK answers: No such file or directory
[root@hp-dl160g8-01 ~]# cat /var/log/messages | grep 'Connection timed out'
[root@hp-dl160g8-01 ~]# echo $?
1
[root@hp-dl160g8-01 ~]#

Comment 9 Olga Kornieskaia 2016-01-12 13:25:00 UTC
If this is a localhost mount then mount goes over loopback interface, not eth0.
I don't see any delay in your output.

Comment 10 Yongcheng Yang 2016-02-04 09:53:20 UTC
(In reply to Olga Kornieskaia from comment #9)
> If this is a localhost mount then mount goes over loopback interface, not
> eth0.
> I don't see any delay in your output.
Thanks Olga, I have been able to reproduce this problem.
Thanks again for your detailed reproducer!

> I was also able to reproduce this locally using "tc".
> 
> 1. sudo mount -t nfs4 -osec=krb5 192.168.1.120:/ /mnt
> 2. kinit
> 3. tc qdisc add dev eth0 root handle 1: prio
> 4. tc filter add dev eth0 parent 1:0 protocol ip pref 55 handle ::55 u32
> match ip dst 192.168.1.120 flowid 2:1
               ^^^^^^^^^^^^^ It's the server's IP, previously I wrongly understand
> 5. tc qdisc change dev eth0 parent 1:1 handle 2: netem delay 10000ms 10000ms
> 6. echo testing > /mnt/test
> 
> this fails with "permission denied"
> 
> Then run gssd with "-T 60" and see that test succeeds.

Comment 12 Yongcheng Yang 2016-02-19 07:41:26 UTC
Move to VERIFIED as Comment 11

Comment 14 errata-xmlrpc 2016-05-11 00:04:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0868.html