Red Hat Bugzilla – Full Text Bug Listing
|Summary:||nfs mount for anaconda kickstart file takes a long time|
|Product:||[Fedora] Fedora||Reporter:||Orion Poplawski <orion>|
|Component:||nfs-utils||Assignee:||Steve Dickson <steved>|
|Status:||CLOSED DUPLICATE||QA Contact:||Fedora Extras Quality Assurance <extras-qa>|
|Fixed In Version:||Doc Type:||Bug Fix|
|Doc Text:||Story Points:||---|
|Last Closed:||2009-10-23 08:58:43 EDT||Type:||---|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
Description Orion Poplawski 2009-09-30 11:31:40 EDT
Description of problem: With the change to nfsv4 mounts, it now takes about 35 seconds to get the kickstart file from the server. Previously it took less than a second. Also seeing: rpcbind: server localhost not responding, timed out in syslog. /etc/exports: /export *.cora.nwra.com(rw,sync,fsid=0,no_root_squash) /export/data1 *.cora.nwra.com(rw,sync,nohide,no_root_squash) nfs kickstart url: nfs:saga:/data1/ks/rawhide-desk.cfg
Comment 1 Steve Dickson 2009-10-01 11:00:09 EDT
would there be any way for you to post a bzip2-ed tshark network trace?
Comment 2 Orion Poplawski 2009-10-01 11:36:52 EDT
Created attachment 363360 [details] tshark trace dump - raw Only 16k. This is just port 2049. Let me know if you want other ports. 1 0.000000 192.168.0.191 -> 192.168.0.12 TCP wpgs > nfs [SYN] Seq=0 Win=5840 Len=0 MSS=1460 TSV=4294683552 TSER=0 WS=6 2 0.000063 192.168.0.12 -> 192.168.0.191 TCP nfs > wpgs [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 TSV=89252403 TSER=4294683552 WS=6 3 0.000600 192.168.0.191 -> 192.168.0.12 TCP wpgs > nfs [ACK] Seq=1 Ack=1 Win=5888 Len=0 TSV=4294683554 TSER=89252403 4 0.000903 192.168.0.191 -> 192.168.0.12 NFS V4 NULL Call 5 0.000939 192.168.0.12 -> 192.168.0.191 TCP nfs > wpgs [ACK] Seq=1 Ack=45 Win=5824 Len=0 TSV=89252404 TSER=4294683554 6 0.000985 192.168.0.12 -> 192.168.0.191 NFS V4 NULL Reply (Call In 4) 7 0.001148 192.168.0.191 -> 192.168.0.12 TCP wpgs > nfs [ACK] Seq=45 Ack=29 Win=5888 Len=0 TSV=4294683555 TSER=89252404 8 35.067646 192.168.0.191 -> 192.168.0.12 NFS V4 COMP Call <EMPTY> PUTROOTFH PUTROOTFH;GETFH GETFH;GETATTR GETATTR 9 35.081904 192.168.0.12 -> 192.168.0.191 NFS V4 COMP Reply (Call In 8) <EMPTY> PUTROOTFH PUTROOTFH;GETFH GETFH;GETATTR GETATTR syslog: RPC: Registered udp transport module RPC: Registered tcp transport module Slow work thread pool: Starting up Slow work thread pool: Ready FS-Cache: Loaded FS-Cache: Netfs 'nfs' registered for caching rpcbind: server localhost not responding, timed out NET: Registered protocol family 10 lo: Disabled Privacy Extensions mount.nfs used greatest stack depth: 4164 bytes left Maybe it just takes that long to get the nfsv4 stack up. Maybe it could get started up earlier? Maybe rpcbind is not configured to listen to localhost and needs to be. In fact, rpcbind is not running, maybe it needs to be? It's not a big deal, but I noticed it when the mount started failing because I needed to change the mount point to correspond to the nfsv4 root and it took a while to respond when before it was instantaneous.
Comment 3 Steve Dickson 2009-10-01 12:04:07 EDT
Hmm.... The trace definitely shows a 35sec delay between the mount and the initial traffic... Does this happen all the time?
Comment 4 Orion Poplawski 2009-10-01 12:32:35 EDT
Yes. Is the "NULL" command the mount? I can duplicate on a running system if I stop rpcbind. Messages show: Oct 1 10:30:45 localhost kernel: rpcbind: server localhost not responding, timed out Oct 1 10:30:45 localhost rpc.idmapd: nss_getpwnam: name 'email@example.com' does not map into domain 'localdomain' So it looks like nfsv4 mounts need to talk to rpc.idmapd, which isn't configured in the anaconda environment?
Comment 5 Orion Poplawski 2009-10-01 12:36:12 EDT
strace on mount shows the hang here: [pid 2542] mount("saga:/data1", "/mnt", "nfs", 0, "addr=192.168.0.12,vers=4,clienta"...) = 0 <35.020464>
Comment 6 Steve Dickson 2009-10-01 13:19:59 EDT
> Is the "NULL" command the mount? Yes, we check to see if V4 is support and is up and running before do the actual mount... If neither one of those is true we roll back to v3. > So it looks like nfsv4 mounts need to talk to rpc.idmapd, which isn't > configured in the anaconda environment? Yes, this what is causing the delay... One of the main differences in V4 is the uids and gids of a file are no longer passed between the server and client. Instead, strings are passed back and forth in the format of username@domain Then user string is converted into uid and gid on both the server and client. This is done by the kernel doing an "upcall" to rpc.idmapd. So I'm guessing there is a 35sec time out on that upcall. >Oct 1 10:30:45 localhost kernel: rpcbind: server localhost not responding, > timed out hmm... I wonder who is trying to talk to rpcbind and why?? > Oct 1 10:30:45 localhost rpc.idmapd: nss_getpwnam: name > 'firstname.lastname@example.org' does not map into domain 'localdomain' In /etc/idmapd.conf, try setting 'Domain = cora.nwra.com' to see if this goes away... Question: How many users are needed for an install? One, root, correct?
Comment 7 Orion Poplawski 2009-10-01 14:01:12 EDT
(In reply to comment #6) > > Oct 1 10:30:45 localhost rpc.idmapd: nss_getpwnam: name > > 'email@example.com' does not map into domain 'localdomain' > In /etc/idmapd.conf, try setting 'Domain = cora.nwra.com' > to see if this goes away... It makes both messages go away (rpcbind,rpc.idmapd), though not the 35 sec delay.
Comment 8 Steve Dickson 2009-10-01 15:44:12 EDT
Just to be clear, you are starting up rpc.idmapd with the domain set to cora.nwra.com and you are still seeing the delay??
Comment 9 Orion Poplawski 2009-10-01 15:51:54 EDT
(In reply to comment #8) > Just to be clear, you are starting up rpc.idmapd with the domain > set to cora.nwra.com and you are still seeing the delay?? Yes. I thought the rpcbind message went away, but it looks like it is still there: Oct 1 13:49:52 localhost kernel: rpcbind: server localhost not responding, timed out Start rpcbind again and the above and the delay goes away.
Comment 10 Steve Dickson 2009-10-02 12:31:51 EDT
It appears the v4 mount is getting caught in some legacy sunrpc code.... I've started the upstream discussion as to how we should handle this... http://linux-nfs.org/pipermail/nfsv4/2009-October/011378.html