Bug 1334510 - nfs v4 mounts with sec=sys are slow to succeed or fail when client and server are kerberized [NEEDINFO]
Summary: nfs v4 mounts with sec=sys are slow to succeed or fail when client and server...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: nfs-utils
Version: 7.2
Hardware: All
OS: Linux
high
medium
Target Milestone: rc
: ---
Assignee: Steve Dickson
QA Contact: Yongcheng Yang
Milan Navratil
URL:
Whiteboard:
Depends On:
Blocks: 1298243
TreeView+ depends on / blocked
 
Reported: 2016-05-09 19:45 UTC by Frank Sorenson
Modified: 2020-09-10 09:37 UTC (History)
20 users (show)

Fixed In Version: nfs-utils-1.3.0-0.57.el7
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2018-10-30 11:48:04 UTC
Target Upstream Version:
steved: needinfo? (mnavrati)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2298191 0 None None None 2017-07-25 15:24:06 UTC
Red Hat Product Errata RHBA-2018:3311 0 None None None 2018-10-30 11:48:44 UTC

Description Frank Sorenson 2016-05-09 19:45:32 UTC
Description of problem:

In RHEL 7.2, nfs v4 mount attempts are very slow, and can take many seconds (some reports have been greater than a minute), either to succeed or fail.


The issue appears to occur on systems configured with kerberos, when connecting to nfs exports which do not require kerberos, when rpc-gssd.service is running.


Version-Release number of selected component (if applicable):

kernel-3.10.0-327.10.1.el7.x86_64
nfs-utils-1.3.0-0.21.el7_2.x86_64


How reproducible:
see below


Steps to Reproduce:

(nfs client system should be configured as kerberos client)

nfs server:
# mkdir -p /exports
# echo "/exports *(rw,no_root_squash)" >> /etc/exports
# systemctl restart nfs-config.service
# systemctl restart nfs-server.service

nfs client:

# time mount -vvvv -t nfs -o vers=4 vm17:/exports /mnt/tmp
mount.nfs: timeout set for Mon May  9 14:50:37 2016
mount.nfs: trying text-based options 'vers=4,addr=192.168.122.85,clientaddr=192.168.122.73'

real	0m6.193s
user	0m0.001s
sys	0m0.008s

# umount /mnt/tmp
# time mount -vvvv -t nfs -o vers=4 vm17:/exports/none /mnt/tmp
mount.nfs: timeout set for Mon May  9 14:53:09 2016
mount.nfs: trying text-based options 'vers=4,addr=192.168.122.85,clientaddr=192.168.122.73'
mount.nfs: mount(2): No such file or directory
mount.nfs: mounting vm17:/exports/none failed, reason given by server: No such file or directory

real	0m3.645s
user	0m0.001s
sys	0m0.008s


# systemctl stop rpc-gssd.service
# time mount -vvvv -t nfs -o vers=4 vm17:/exports/ /mnt/tmp
mount.nfs: timeout set for Mon May  9 14:54:09 2016
mount.nfs: trying text-based options 'vers=4,addr=192.168.122.85,clientaddr=192.168.122.73'

real	0m0.021s
user	0m0.000s
sys	0m0.008s

# umount /mnt/tmp
# time mount -vvvv -t nfs -o vers=4 vm17:/exports/none /mnt/tmp
mount.nfs: timeout set for Mon May  9 14:54:26 2016
mount.nfs: trying text-based options 'vers=4,addr=192.168.122.85,clientaddr=192.168.122.73'
mount.nfs: mount(2): No such file or directory
mount.nfs: mounting vm17:/exports/none failed, reason given by server: No such file or directory

real	0m0.029s
user	0m0.002s
sys	0m0.005s


# systemctl start rpc-gssd.service
# time mount -vvvv -t nfs -o vers=4 vm17:/exports /mnt/tmp
mount.nfs: timeout set for Mon May  9 14:55:03 2016
mount.nfs: trying text-based options 'vers=4,addr=192.168.122.85,clientaddr=192.168.122.73'

real	0m6.988s
user	0m0.003s
sys	0m0.006s

# umount /mnt/tmp
# time mount -vvvv -t nfs -o vers=4,sec=sys vm17:/exports /mnt/tmp
mount.nfs: timeout set for Mon May  9 14:41:03 2016
mount.nfs: trying text-based options 'vers=4,sec=sys,addr=192.168.122.85,clientaddr=192.168.122.73'

real	0m2.348s
user	0m0.005s
sys	0m0.003s



Actual results:

nfs mounts complete slowly
slow mount attempts


Expected results:

nfs mounts complete quickly


Additional info:

This is believed to be related to security negotiations during an nfs v4 mount request.  If rpc-gssd is running, various upcalls and kerberos negotiations take place, even if kerberos is not being used on a mount point.


A workaround is to stop rpc-gssd:
# systemctl stop rpc-gssd.service

and permanently disable rpc-gssd for future system restarts:
# mkdir -p /etc/systemd/system/rpc-gssd.service.d/
# echo -e "[Unit]\nConditionNull=false" > /etc/systemd/system/rpc-gssd.service.d/01-disable.conf

Comment 3 Steve Dickson 2016-05-12 11:28:11 UTC
Does a /etc/krb5.keytab file exists? If so what does klist -k output?

Comment 6 Frank Sorenson 2016-05-16 19:19:43 UTC
Aha.  The packet capture makes it obvious.

The delay in each instance is for 20-25 DNS requests each for the SRV record for  _kerberos-master._udp.REALM and _kerberos-master._tcp.REALM (alternating), as well as 10 DNS lookups for the TXT record of _kerberos.NFS_SERVER_HOSTNAME.  There are also some tcp-level connect/disconnect sequences to port 2049:

  1 0.000000000 192.168.122.73 -> 192.168.122.85 TCP 74 822 > nfs [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=4225046357 TSecr=0 WS=128
  2 0.000441387 192.168.122.85 -> 192.168.122.73 TCP 74 nfs > 822 [SYN, ACK] Seq=0 Ack=1 Win=14480 Len=0 MSS=1460 SACK_PERM=1 TSval=404399202 TSecr=4225046357 WS=128
  3 0.000482014 192.168.122.73 -> 192.168.122.85 TCP 66 822 > nfs [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=4225046357 TSecr=404399202
  4 0.001261157 192.168.122.73 -> 192.168.122.85 NFS 110 V4 NULL Call
  5 0.001565527 192.168.122.85 -> 192.168.122.73 TCP 66 nfs > 822 [ACK] Seq=1 Ack=45 Win=14592 Len=0 TSval=404399203 TSecr=4225046358
  6 0.001584251 192.168.122.85 -> 192.168.122.73 NFS 94 V4 NULL Reply (Call In 4)
  7 0.001592575 192.168.122.73 -> 192.168.122.85 TCP 66 822 > nfs [ACK] Seq=45 Ack=29 Win=29312 Len=0 TSval=4225046358 TSecr=404399203

  8 0.145085260 192.168.122.73 -> 192.168.122.2 DNS 101 Standard query 0xf695  SRV _kerberos-master._udp.EXAMPLE.COM
  9 0.333948294 192.168.122.2 -> 192.168.122.73 DNS 169 Standard query response 0xf695 No such name
 10 0.334553187 192.168.122.73 -> 192.168.122.2 DNS 101 Standard query 0x1f43  SRV _kerberos-master._tcp.EXAMPLE.COM
 11 0.561016469 192.168.122.2 -> 192.168.122.73 DNS 169 Standard query response 0x1f43 No such name

 12 0.599109091 192.168.122.73 -> 192.168.122.85 TCP 74 52341 > nfs [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=4225046956 TSecr=0 WS=128
 13 0.599452297 192.168.122.85 -> 192.168.122.73 TCP 74 nfs > 52341 [SYN, ACK] Seq=0 Ack=1 Win=14480 Len=0 MSS=1460 SACK_PERM=1 TSval=404399801 TSecr=4225046956 WS=128
 14 0.599489233 192.168.122.73 -> 192.168.122.85 TCP 66 52341 > nfs [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=4225046956 TSecr=404399801

 15 0.621868854 192.168.122.73 -> 192.168.122.2 DNS 101 Standard query 0x4ec4  SRV _kerberos-master._udp.EXAMPLE.COM
 16 0.741135765 192.168.122.2 -> 192.168.122.73 DNS 169 Standard query response 0x4ec4 No such name
 17 0.741665707 192.168.122.73 -> 192.168.122.2 DNS 101 Standard query 0xa821  SRV _kerberos-master._tcp.EXAMPLE.COM
 19 0.884089936 192.168.122.2 -> 192.168.122.73 DNS 169 Standard query response 0xa821 No such name
 20 0.884587779 192.168.122.73 -> 192.168.122.2 DNS 74 Standard query 0x921a  TXT _kerberos.vm17
 21 0.885576934 192.168.122.2 -> 192.168.122.73 DNS 149 Standard query response 0x921a No such name
 22 0.886847521 192.168.122.73 -> 192.168.122.2 DNS 101 Standard query 0xbb42  SRV _kerberos-master._udp.EXAMPLE.COM
 23 1.028782501 192.168.122.2 -> 192.168.122.73 DNS 169 Standard query response 0xbb42 No such name
 24 1.029242872 192.168.122.73 -> 192.168.122.2 DNS 101 Standard query 0x5cca  SRV _kerberos-master._tcp.EXAMPLE.COM
 25 1.173429791 192.168.122.2 -> 192.168.122.73 DNS 169 Standard query response 0x5cca No such name

 26 1.174226841 192.168.122.73 -> 192.168.122.85 TCP 66 52341 > nfs [FIN, ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=4225047531 TSecr=404399801
 27 1.174603443 192.168.122.85 -> 192.168.122.73 TCP 66 nfs > 52341 [FIN, ACK] Seq=1 Ack=2 Win=14592 Len=0 TSval=404400376 TSecr=4225047531
 28 1.174631542 192.168.122.73 -> 192.168.122.85 TCP 66 52341 > nfs [ACK] Seq=2 Ack=2 Win=29312 Len=0 TSval=4225047531 TSecr=404400376

 29 1.177653652 192.168.122.73 -> 192.168.122.85 TCP 74 39538 > nfs [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=4225047534 TSecr=0 WS=128
 30 1.177929836 192.168.122.85 -> 192.168.122.73 TCP 74 nfs > 39538 [SYN, ACK] Seq=0 Ack=1 Win=14480 Len=0 MSS=1460 SACK_PERM=1 TSval=404400380 TSecr=4225047534 WS=128
 31 1.177963145 192.168.122.73 -> 192.168.122.85 TCP 66 39538 > nfs [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=4225047535 TSecr=404400380

 32 1.180957554 192.168.122.73 -> 192.168.122.2 DNS 101 Standard query 0xbee3  SRV _kerberos-master._udp.EXAMPLE.COM
...
183 6.474097150 192.168.122.73 -> 192.168.122.2 DNS 74 Standard query 0xa70b  TXT _kerberos.vm17
184 6.475046414 192.168.122.2 -> 192.168.122.73 DNS 149 Standard query response 0xa70b No such name
185 6.476200503 192.168.122.73 -> 192.168.122.2 DNS 101 Standard query 0x68d7  SRV _kerberos-master._udp.EXAMPLE.COM
186 6.619474757 192.168.122.2 -> 192.168.122.73 DNS 169 Standard query response 0x68d7 No such name
187 6.619943366 192.168.122.73 -> 192.168.122.2 DNS 101 Standard query 0xc42b  SRV _kerberos-master._tcp.EXAMPLE.COM
188 6.762759088 192.168.122.2 -> 192.168.122.73 DNS 169 Standard query response 0xc42b No such name

189 6.763515971 192.168.122.73 -> 192.168.122.85 TCP 66 33746 > nfs [FIN, ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=4225053120 TSecr=404405384
190 6.763835164 192.168.122.85 -> 192.168.122.73 TCP 66 nfs > 33746 [FIN, ACK] Seq=1 Ack=2 Win=14592 Len=0 TSval=404405966 TSecr=4225053120
191 6.763859110 192.168.122.73 -> 192.168.122.85 TCP 66 33746 > nfs [ACK] Seq=2 Ack=2 Win=29312 Len=0 TSval=4225053121 TSecr=404405966
192 6.764459589 192.168.122.73 -> 192.168.122.85 NFS 194 V4 Call PUTROOTFH | GETATTR
193 6.764868515 192.168.122.85 -> 192.168.122.73 NFS 326 V4 Reply (Call In 192) PUTROOTFH | GETATTR
194 6.764896753 192.168.122.73 -> 192.168.122.85 TCP 66 822 > nfs [ACK] Seq=501 Ack=401 Win=30336 Len=0 TSval=4225053122 TSecr=404405967
195 6.764979095 192.168.122.73 -> 192.168.122.85 NFS 198 V4 Call GETATTR FH: 0x62d40c52
...


Digging a bit further, this is controlled by several configuration items in /etc/krb5.conf:

in the individual realm configuration of the '[realms]' section:
	master_kdc = <kerberos-master-server-hostname>
		if this is set, there is no need to resort to checking for dns SRV records, and the mount attempt completes immediately

if master_kdc is not set, DNS may be checked, depending on the settings for 'dns_lookup_kdc' and 'dns_lookup_realm' (which may be in the '[libdefaults]' or '[realms]' section):
	dns_lookup_kdc = [ 0 | false | 1 | true ]
		if enabled, as many as 25 dns requests for SRV records for '_kerberos-master._udp.EXAMPLE.COM' and '_kerberos-master._tcp.EXAMPLE.COM' will occur.  This defaults to enabled, so must be explicitly disabled to avoid the dns lookups

	dns_lookup_realm = [ 0 | false | 1 | true ]
		if enabled, as many as 10 dns requests for TXT record for '_kerberos.MOUNT_HOSTNAME' will occur.  This defaults to enabled, so must be explicitly disabled to avoid the dns lookups
		(an exception appears to be if 'kdc = <kerberos-hostname>' is also not specified--in this case, the TXT lookups do not happen)


several additional points:
	* if both 'dns_lookup_kdc' and 'dns_lookup_realm' are disabled, or if 'master_kdc' is given, no dns lookups are made, and the mount completes right away.

	* I think the variability in time between test runs, as well as the delay difference between my test systems and various customers, may be due to differences in the time required to make the dns requests, rather than a particular timeout (at least, I did not see an applicable timeout)

	* it doesn't matter whether the dns server returns a correct response--the TXT and/or SRV lookups will still happen

	* the dns lookups occur even if 'sec=sys' is specified


manpage for krb5.conf:
	dns_lookup_kdc
		Indicate  whether DNS SRV records should be used to locate the KDCs and other servers for a realm, if they are not
		listed in the krb5.conf information for the realm.  (Note that the admin_server entry must  be  in  the  krb5.conf
		realm information in order to contact kadmind, because the DNS implementation for kadmin is incomplete.)

		Enabling  this option does open up a type of denial-of-service attack, if someone spoofs the DNS records and redi‐
		rects you to another server.  However, it's no worse than a denial of service,  because  that  fake  KDC  will  be
		unable  to decode anything you send it (besides the initial ticket request, which has no encrypted data), and any‐
		thing the fake KDC sends will not be trusted without verification using some secret that it won't know.

	dns_lookup_realm
		not documented

Comment 10 Steve Dickson 2016-05-30 18:42:18 UTC
Also please start rpc.gssd with the '-vvv' flag then post what 
is logged in /var/log/message.

Comment 11 Frank Sorenson 2016-05-30 21:23:18 UTC
verbose rpc.gssd (with libtirpc debugging and kernel rpc:auth debugging as well) is in the attachment in comment 8

I just did another run, and I'll attach a new set of packet capture and verbose messages.

Comment 15 Steve Dickson 2016-06-06 14:49:51 UTC
Well the problem is when the NFS v4 client tried to identify 
itself to the NFS serve with a SETCLIENTID proc the default 
is to used the krb5i Kerberos favor. For this the kernel does 
an upcall to rpc.gssd to get the machine cred to use.

rpc.gssd looks in /etc/krb5.keytab for the machine cred
and then queries the Kerberos server to verify the
cred.  

Now when Kerberos is not configured there will not be
a /etc/krb5.keytab file which means rpc.gssd will not
startup which means the kernel upcall will not happen
since rpc.gssd is not runnning. 

So when the /etc/krb5.keytab exists, it's assuming 
Kerberos is configured for NFS, which in this case
is a bad assumption. Because from the traces and 
logs this client is unknown to the Kerberos server. 

The time is being spent in all those failed upcalls
and queries.

Unfortunately there is no way to tell rpc.gssd not
to start up when /etc/krb5.keytab.. Maybe there should be??

Would it be possible to add a principal that looks like
   host/hostname.domainname@REALM

to the Kerberos server?

Comment 20 Steve Dickson 2016-06-21 17:24:46 UTC
The upstream patch
   http://marc.info/?t=146652141200001&r=1&w=2

Comment 23 Frank Sorenson 2016-06-28 15:20:04 UTC
Based on the discussion so far, I think that perhaps the best way to address this would be documenting the conditions which combine to make the mount attempts slow.  There seem to be several workarounds, and not fully resolving the nfs client and server in DNS is arguably a misconfiguration.

(and yes, I do realize I opened the bz...)

Comment 30 Steve Dickson 2017-03-28 17:21:21 UTC
The official systemd way of disabling services is to
create a drop-in configuration as described above.
In this case do the following:

# mkdir -p /etc/systemd/system/rpc-gssd.service.d/
# echo -e "[Unit]\nConditionNull=false" > /etc/systemd/system/rpc-gssd.service.d/01-disable.conf

Which does indeed work...

Comment 42 Scott Mayhew 2017-07-25 14:22:53 UTC
Closing this again.  There are two methods for disabling rpc.gssd.


Method 1:  Using a systemd drop-in config file.  

Create /etc/systemd/system/rpc-gssd.service.d/01-disable.conf containing:
   
[Unit]
ConditionNull=false


Method 2:  Masking the service.  

Note that for backward compatibility purposes there is a rpc-gssd.service is symlinked to nfs-secure service, so it is necessary to mask both:

# systemctl mask rpc-gssd.service
# systemctl mask nfs-secure.service

Comment 58 Frank Sorenson 2018-09-17 13:01:24 UTC
nfs-secure.service is a symlink to rpc-gssd.service, so masking just rpc-gssd.service effectively masks both.  nfs-secure.service will point to rpc-gssd.service, which points to /dev/null

If both had to be masked, there is something wrong with how the customer is installing the system (perhaps they masked nfs-secure.service first, changing the symlink, but not actually masking anything).

Comment 59 Yongcheng Yang 2018-09-18 08:47:02 UTC
(In reply to Frank Sorenson from comment #58)

Hi Frank, thanks for your explanation!

> nfs-secure.service is a symlink to rpc-gssd.service, so masking just
> rpc-gssd.service effectively masks both.  nfs-secure.service will point to
> rpc-gssd.service, which points to /dev/null

Agree! Now I'm thinking it might be systemd bug and just file Bug 1630224 to track it.

Moving this on to VERIFIED now.

Comment 62 errata-xmlrpc 2018-10-30 11:48:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3311


Note You need to log in before you can comment on or make changes to this bug.