165229 – nfs install hangs

Bug 165229 - nfs install hangs

Summary: nfs install hangs

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Steve Dickson
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-08-05 17:11 UTC by Orion Poplawski
Modified:	2007-11-30 22:11 UTC (History)
CC List:	2 users (show)
Fixed In Version:	2.6.15-1.2054_FC5
Clone Of:
Environment:
Last Closed:	2007-04-24 22:12:15 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
tcpdump -w dump (1.01 MB, application/octet-stream) 2005-08-05 17:47 UTC, Orion Poplawski	no flags	Details
Gzipped system trace (16.99 KB, application/octet-stream) 2005-09-19 19:40 UTC, Orion Poplawski	no flags	Details
tcpdump -s 1500 trace of data from client to server (241.27 KB, application/octet-stream) 2007-04-24 16:52 UTC, Orion Poplawski	no flags	Details
View All

Description Orion Poplawski 2005-08-05 17:11:35 UTC

Description of problem:
This may be a kernel or other issue, but I'm starting with anaconda as it is an
install problem.

We use a central NFS server for all of our Fedora installs.  We have not seen
problems with any other systems.  I'm trying to install FC4 onto two
dual-opteron machines, uses e100 driver.  The installl hangs with the following
messages:

nfs warning: mount version older than kernel
nfs: server alexandria not responding, still trying
[repeats]

anaconda log shows:
* host is alexandria, dir is /export/data1/fedora/cora/4/x86_64/os
* mounting nfs path alexandria:/export/data1/fedora/cora/4/x86_64/os
* mounted alexandria:/export/data1/fedora/cora/4/x86_64/os on /mnt/source
* can access /mnt/source/Fedora/base/stage2.img
* mntloop loop0 on /mnt/runtime as /mnt/source/Fedora/base/stage2.img fs is 26


How reproducible:
Everytime

Comment 1 Orion Poplawski 2005-08-05 17:47:32 UTC

Created attachment 117495 [details]
tcpdump -w dump

Using an updated install image (from current updated FC4) exhibits the same
problem.

This is a tcpdump packet capture of netowrk traffic between the install machine
and the nfs server.

Comment 2 Orion Poplawski 2005-08-09 22:48:11 UTC

Well, it's not just x86_64.  Just saw it on our dual xeon server.  It also uses
an e100 nic.  Also tried this with an e1000 nic in the machine and it timed out
as well, though it got as far as running anaconda, but that was the last message
before timing out.

Other commonality is that they are all on the same switch as the server
(SMC8508T), while other machines are at least a hop away on our Cisco switch stack.

I have managed to install on other (single processor) 32-bit e100 machines.

Comment 3 Steve Dickson 2005-09-01 12:27:21 UTC

hmm... is appears the server is going off the deep
end... Would it be possible to get a system trace 
from 'alexandria' by doing a 'echo t > sysrq-trigger'

Comment 4 Orion Poplawski 2005-09-19 19:40:12 UTC

Created attachment 118999 [details]
Gzipped system trace

Server is fine except for these particularly FC4 upgrade attempts.  It is one
of our main NFS file servers and we generally don't have problems with it.

Here is the trace while the client (FC4 upgrade) is stalling.

Let me know what else I can do/send.

Comment 5 Dave Jones 2005-11-10 19:28:41 UTC

2.6.14-1.1637_FC4 has been released as an update for FC4.
Please retest with this update, as a large amount of code has been changed in
this release, which may have fixed your problem.

Thank you.

Comment 6 Dave Jones 2006-02-03 05:32:04 UTC

This is a mass-update to all currently open kernel bugs.

A new kernel update has been released (Version: 2.6.15-1.1830_FC4)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO_REPORTER state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

Thank you.

Comment 7 Orion Poplawski 2006-04-05 18:18:41 UTC

Just ran into this again installing FC4 x86_64 from our FC4
(2.6.15-1.1831_FC4smp) NFS install server.  Here's the rub though:  The systems
are on the same SMC8508T switch.  If I move the client (install target) to
another switch, the install works fine.  The client has a 100Mbit NIC and the
server a 1GB NIC.

And this is only during install.  Once installed, I move the wire back to the
SMC switch and everything is fine.  Although now that I think about it, NFS
traffic travels over a separate gigabit network.  Just did a basic test mounting
the partition over 100MB nic and it can copy data off of it just fine.

Comment 8 Steve Dickson 2006-07-25 04:08:38 UTC

This sound more like network problem than an NFS problem... When
the system hangs, can you see (via etheral or tcpdump) any 
any traffic at all?

Comment 9 Dave Jones 2006-09-17 02:03:09 UTC

[This comment added as part of a mass-update to all open FC4 kernel bugs]

FC4 has now transitioned to the Fedora legacy project, which will continue to
release security related updates for the kernel.  As this bug is not security
related, it is unlikely to be fixed in an update for FC4, and has been migrated
to FC5.

Please retest with Fedora Core 5.

Thank you.

Comment 10 Orion Poplawski 2006-09-18 15:25:39 UTC

This does appear to be fixed in FC5.

Comment 11 Orion Poplawski 2007-04-24 16:50:32 UTC

Seeing this again trying to install current rawhide on a laptop.  Our nfs server
is now running FC6.

Comment 12 Orion Poplawski 2007-04-24 16:52:43 UTC

Created attachment 153364 [details]
tcpdump -s 1500 trace of data from client to server


This is kind of odd, from the end of the trace:

10:41:09.507179 IP (tos 0xc0, ttl  64, id 29724, offset 0, flags [none], proto:
ICMP (1), length: 576) cynosure.cora.nwra.com > saga.cora.nwra.com: ICMP ip
reassembly time exceeded, length 556
	IP (tos 0x0, ttl  64, id 54448, offset 0, flags [+], proto: UDP (17),
length: 1500) saga.cora.nwra.com.nfs > cynosure.cora.nwra.com.2729529335: reply
ok 1472 read REG 100644 ids 537/537 sz 94142464 nlink 1 rdev 0/0 fsid 1605
fileid e94006 a/m/ctime 1177424659.000000 1177403744.000000 1177424927.000000
16384 bytes
	MPLS extension v4 packet not supported

Comment 13 Orion Poplawski 2007-04-24 17:43:14 UTC

If I tell anaconda to use tcp (--opts=tcp in kickstart file), everything works fine.

Looking more in the logs, looks like the server is sending lots of IP fragments (buffer size of around 16k?) 
but the client is not receiving them and is resending the request.

This could be network problems on our network I suppose as we don't normally use UDP for NFS and 
moving connections to different switches has helped at times.  But in general, I don't see that much 
problems on our network.

Comment 14 Steve Dickson 2007-04-24 22:12:15 UTC

It could be the case that your not seeing network problems
because not too many applications using UDP these days.
But going to NFS over TCP is definitely the correct solution
and I'm not sure why --opts=tcp is the the default...

Note You need to log in before you can comment on or make changes to this bug.