Bug 526494

Summary: nfs mount for anaconda kickstart file takes a long time
Product: [Fedora] Fedora Reporter: Orion Poplawski <orion>
Component: nfs-utilsAssignee: Steve Dickson <steved>
Status: CLOSED DUPLICATE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: rawhideCC: anaconda-maint-list, steved
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-10-23 12:58:43 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
tshark trace dump - raw none

Description Orion Poplawski 2009-09-30 15:31:40 UTC
Description of problem:

With the change to nfsv4 mounts, it now takes about 35 seconds to get the kickstart file from the server.  Previously it took less than a second.

Also seeing:

rpcbind: server localhost not responding, timed out

in syslog.

/etc/exports:

/export         *.cora.nwra.com(rw,sync,fsid=0,no_root_squash)
/export/data1   *.cora.nwra.com(rw,sync,nohide,no_root_squash)

nfs kickstart url:

nfs:saga:/data1/ks/rawhide-desk.cfg

Comment 1 Steve Dickson 2009-10-01 15:00:09 UTC
would there be any way for you to post a bzip2-ed tshark
network trace?

Comment 2 Orion Poplawski 2009-10-01 15:36:52 UTC
Created attachment 363360 [details]
tshark trace dump - raw

Only 16k.  This is just port 2049.  Let me know if you want other ports.

  1   0.000000 192.168.0.191 -> 192.168.0.12 TCP wpgs > nfs [SYN] Seq=0 Win=5840 Len=0 MSS=1460 TSV=4294683552 TSER=0 WS=6
  2   0.000063 192.168.0.12 -> 192.168.0.191 TCP nfs > wpgs [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 TSV=89252403 TSER=4294683552 WS=6
  3   0.000600 192.168.0.191 -> 192.168.0.12 TCP wpgs > nfs [ACK] Seq=1 Ack=1 Win=5888 Len=0 TSV=4294683554 TSER=89252403              
  4   0.000903 192.168.0.191 -> 192.168.0.12 NFS V4 NULL Call         
  5   0.000939 192.168.0.12 -> 192.168.0.191 TCP nfs > wpgs [ACK] Seq=1 Ack=45 Win=5824 Len=0 TSV=89252404 TSER=4294683554                 
  6   0.000985 192.168.0.12 -> 192.168.0.191 NFS V4 NULL Reply (Call In 4) 
  7   0.001148 192.168.0.191 -> 192.168.0.12 TCP wpgs > nfs [ACK] Seq=45 Ack=29 Win=5888 Len=0 TSV=4294683555 TSER=89252404                    
  8  35.067646 192.168.0.191 -> 192.168.0.12 NFS V4 COMP Call <EMPTY> PUTROOTFH PUTROOTFH;GETFH GETFH;GETATTR GETATTR                          
  9  35.081904 192.168.0.12 -> 192.168.0.191 NFS V4 COMP Reply (Call In 8) <EMPTY> PUTROOTFH PUTROOTFH;GETFH GETFH;GETATTR GETATTR      


syslog:

RPC: Registered udp transport module
RPC: Registered tcp transport module
Slow work thread pool: Starting up
Slow work thread pool: Ready
FS-Cache: Loaded
FS-Cache: Netfs 'nfs' registered for caching
rpcbind: server localhost not responding, timed out
NET: Registered protocol family 10
lo: Disabled Privacy Extensions
mount.nfs used greatest stack depth: 4164 bytes left

Maybe it just takes that long to get the nfsv4 stack up.  Maybe it could get started up earlier?  Maybe rpcbind is not configured to listen to localhost and needs to be.  In fact, rpcbind is not running, maybe it needs to be?

It's not a big deal, but I noticed it when the mount started failing because I needed to change the mount point to correspond to the nfsv4 root and it took a while to respond when before it was instantaneous.

Comment 3 Steve Dickson 2009-10-01 16:04:07 UTC
Hmm.... The trace definitely shows a 35sec delay between the mount
and the initial traffic...   Does this happen all the time?

Comment 4 Orion Poplawski 2009-10-01 16:32:35 UTC
Yes.  Is the "NULL" command the mount?

I can duplicate on a running system if I stop rpcbind.  Messages show:

Oct  1 10:30:45 localhost kernel: rpcbind: server localhost not responding, timed out
Oct  1 10:30:45 localhost rpc.idmapd[1235]: nss_getpwnam: name 'root.com' does not map into domain 'localdomain'

So it looks like nfsv4 mounts need to talk to rpc.idmapd, which isn't configured in the anaconda environment?

Comment 5 Orion Poplawski 2009-10-01 16:36:12 UTC
strace on mount shows the hang here:

[pid  2542] mount("saga:/data1", "/mnt", "nfs", 0, "addr=192.168.0.12,vers=4,clienta"...) = 0 <35.020464>

Comment 6 Steve Dickson 2009-10-01 17:19:59 UTC
> Is the "NULL" command the mount?
Yes, we check to see if V4 is support and is up and running before
do the actual mount... If neither one of those is true we roll
back to v3.

> So it looks like nfsv4 mounts need to talk to rpc.idmapd, which isn't
> configured in the anaconda environment?  
Yes, this what is causing the delay... One of the main differences in
V4 is the uids and gids of a file are no longer passed between the 
server and client. Instead, strings are passed back and forth in the 
format of
    username@domain

Then user string is converted into uid and gid on both the server and
client. This is done by the kernel doing an "upcall" to rpc.idmapd.
So I'm guessing there is a 35sec time out on that upcall.

>Oct  1 10:30:45 localhost kernel: rpcbind: server localhost not responding,
> timed out
hmm... I wonder who is trying to talk to rpcbind and why?? 

> Oct  1 10:30:45 localhost rpc.idmapd[1235]: nss_getpwnam: name
> 'root.com' does not map into domain 'localdomain'
In /etc/idmapd.conf, try setting 'Domain = cora.nwra.com'
to see if this goes away... 

Question:
How many users are needed for an install? One, root, correct?

Comment 7 Orion Poplawski 2009-10-01 18:01:12 UTC
(In reply to comment #6)
> > Oct  1 10:30:45 localhost rpc.idmapd[1235]: nss_getpwnam: name
> > 'root.com' does not map into domain 'localdomain'
> In /etc/idmapd.conf, try setting 'Domain = cora.nwra.com'
> to see if this goes away... 

It makes both messages go away (rpcbind,rpc.idmapd), though not the 35 sec delay.

Comment 8 Steve Dickson 2009-10-01 19:44:12 UTC
Just to be clear, you are starting up rpc.idmapd with the domain
set to cora.nwra.com and you are still seeing the delay??

Comment 9 Orion Poplawski 2009-10-01 19:51:54 UTC
(In reply to comment #8)
> Just to be clear, you are starting up rpc.idmapd with the domain
> set to cora.nwra.com and you are still seeing the delay??  

Yes.  I thought the rpcbind message went away, but it looks like it is still there:

Oct  1 13:49:52 localhost kernel: rpcbind: server localhost not responding, timed out

Start rpcbind again and the above and the delay goes away.

Comment 10 Steve Dickson 2009-10-02 16:31:51 UTC
It appears the v4 mount is getting caught in some legacy sunrpc
code.... I've started the upstream discussion as to how we should
handle this...

 http://linux-nfs.org/pipermail/nfsv4/2009-October/011378.html

Comment 11 Steve Dickson 2009-10-23 12:58:43 UTC

*** This bug has been marked as a duplicate of bug 528537 ***