Bug 486264

Summary:

nfs hangs periodically on only the Fedora client

Product:

[Fedora] Fedora

Reporter:

Rich Ercolani <rercola>

Component:

nfs-utils

Assignee:

Steve Dickson <steved>

Status:

CLOSED WONTFIX

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

medium

Docs Contact:

Priority:

low

Version:

CC:

ant.starikov, bugs, jeronimo, jlayton, kgroutsis, leo, prgarcial, rwheeler, steved, trevor

Target Milestone:

---

Keywords:

Reopened, Triaged

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2010-06-28 11:19:15 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

516998

Attachments:

Description	Flags
A round of output from rpcdebug during one such hang	none
dmesg output from nfs-hung client	none
/v/l/messages \| grep -i nfs output from latest crash	none
better output from a weird /v/l/messages entry	none
dmesg output on nfs file server after /proc rq trigger	none
logfile after sysrq "t"	none

Description Rich Ercolani 2009-02-19 05:59:59 UTC

Description of problem:
Periodically, and with no obvious cause, all NFS connections between our Fedora 10 x86 client and our Solaris 11 NFS server hang, and dmesg informs me that the server is "not responding".

The server is responding to everyone else's requests, however, so it appears to be Fedora at work.

Restarting the nfsd on the Solaris server appears to resolve the problem, possibly by restarting all the open connections?

Version-Release number of selected component (if applicable):
Latest Fedora 10 (1.1.4-8.fc10)

How reproducible:
Happens once every day or three.

Actual results:
nfs connections stop responding until the NFSd is restarted.

Expected results:
nfs connections continue to function and don't fail like clockwork when every other client on the network has no issues.

Additional info:
The Solaris server's nfs reports no errors, and non-NFS connections between the two machines succeed.

Comment 1 Trevor Cordes 2009-03-05 17:41:25 UTC

A few weeks ago I upgraded from F8 to F10.  Everything has been pretty good except my always rock-solid NFS connection is now dying every 5-15 days as per the above.

My client is F10.  My NFS server is F8.  They worked great for as long as I can remember when they were both F8.

service nfs restart on server does NOT bring my connection back.  Neither does restarting the rpcd's I'm running on the server.

I am using NFSv4.
192.168.100.2:/data     /data                   nfs4    rw,bg,hard,intr,noatime,nosuid,proto=tcp,timeo=15,rsize=8192,wsize=8192 0 0

I noticed something very interesting, when the connection dies I get the normal "still trying" log, but only 1 of them!  In the F8 days if the server was not responding, I'd get hundreds of periodic "still trying" until the server was back up.  F10 it just says this ONCE and then it's like it just gives up for good or is hung somehow.
Mar  5 07:56:03 pog kernel: nfs: server 192.168.100.2 not responding, still trying

There are no errors showing up on the server.  The link between the two computers is good as I have multiple ssh sessions open during this time which did not have any problems.

Since my whole system can't live without /data, it's nearly impossible for me to unmount this nfs without rebooting.

Comment 2 Trevor Cordes 2009-03-05 17:53:49 UTC

Ah, I found a workaround to get the connection back up without rebooting.

umount -f /data
mount /data

The -f forces the nfs down even though there are open files.  I tried in vain to find every ps that was using /data and had kill 50% of my running ps's yet umount (without -f) would still complain about "device busy" and lsof /data just hangs.  Not sure how many nasty side effects -f will have but this works for me for now.

The fact the above works shows that the server is probably fine and the network is fine.

Obviously this is not a real, long-term solution.

Comment 3 Steve Dickson 2009-03-06 10:12:05 UTC

Is autofs in the picture at all?

When the connection breaks please run (as root)
   rpcdebug -m rpc -s all 
The recored what is in the dmesgs
   dmesg > /tmp/dmesg

I'm looking for any type of failures the client 
is having...

Note: to turn off the debugging do: rpcdebug -m rpc -c all

Comment 4 Trevor Cordes 2009-03-09 00:04:38 UTC

I'll run that debug next time it fails.

How can I tell if I'm running autofs?  I don't seem to have any autofs rpm installed.

Comment 5 Rich Ercolani 2009-03-09 00:19:31 UTC

In contrast:
I'm using autofs, but it's being used both on the Fedora client and some other machines that stay up the whole time.

Comment 6 Rich Ercolani 2009-03-13 12:07:03 UTC

Created attachment 335078 [details]
A round of output from rpcdebug during one such hang

Comment 7 Rich Ercolani 2009-03-13 15:24:17 UTC

PS: According to the NFS server, it's replying to thousands[!] of requests per second from this client, all with the same response code: NFS4ERR_STALE_CLIENTID.

Any idea how the clientid would have become stale, and how this problem only ever affects this machine?

Comment 8 Rich Ercolani 2009-03-16 19:56:16 UTC

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=a71ee337b31271e701f689d544b6153b75609bc5 or http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=027b6ca02192f381a5a91237ba8a8cf625dc6f6a seem relevant.

Comment 9 Trevor Cordes 2009-03-31 11:40:59 UTC

Strange, but this bug has not hit me again since my last post on Mar 5.  I will keep watching it and run the debug stuff next time.

Comment 10 Trevor Cordes 2009-04-16 07:26:56 UTC

This just hit me overnight.  Was fine when I went to sleep.  Woke up and NFS was hung on the client.

I captured the debug output but I can't interpret it.  Will attach.  I don't see the stale thing.

I'm going to try to recover now without rebooting... a tricky task since I rely heavily on that file server.

Comment 11 Trevor Cordes 2009-04-16 08:07:45 UTC

Created attachment 339810 [details]
dmesg output from nfs-hung client

Comment 12 Trevor Cordes 2009-04-16 08:09:32 UTC

Argh, I had to reboot to recover from it this time.  umount -f wouldn't work (busy/open).

I've thought about using soft mounting until this is resolved but the man page freaks me out when it talks about data loss using soft!  My entire computer becomes useless with about half the apps hanging when my NFS link goes down.

Comment 13 Rich Ercolani 2009-04-16 08:16:54 UTC

Using the "intr" mount option might be an acceptable interim solution, if you're using nfs4 (which you appear to be).

Comment 14 Trevor Cordes 2009-04-16 13:38:55 UTC

I just checked and I already was using intr!  That's bizarre because some of the apps that had /data as a cwd just wouldn't die.  I tried closing everything but some stuff just hung around and I couldn't get it to umount.  That happens nearly every time my nfs4 goes wonky.  Like I said, everything depends on my 5TB file server.

I am using nfs4.

Comment 15 Steve Dickson 2009-04-16 13:47:46 UTC

Just to be clear... 

What kernel versions is this happening on (i.e. uname -r)?

Comment 16 Steve Dickson 2009-04-16 13:51:34 UTC

Also I am not aware that Sun has released Solaris 11... 
Do you mean OpenSolaris? If so which flavor?

Comment 17 Trevor Cordes 2009-04-16 14:30:09 UTC

When it died previously in late Feb I was using:
2.6.27.15-170.2.24.fc10.i686.PAE

When it died today, I was using:
2.6.27.19-170.2.35.fc10.i686.PAE

Right now I am using, since the reboot:
2.6.27.21-170.2.56.fc10.i686.PAE

We'll see if .21 dies...

My nfs4 server is 2.6.26.6-49.fc8

Comment 18 Trevor Cordes 2009-04-24 21:49:32 UTC

Argh, just happened again, though a bit different this time.  I got a bunch of not responding errors on the client this time (previously I didn't see any).

Apr 24 14:53:40 pog kernel: nfs: server 192.168.100.2 not responding, still trying
Apr 24 15:04:59 pog kernel: nfs: server 192.168.100.2 not responding, timed out
Apr 24 15:04:59 pog kernel: nfs: server 192.168.100.2 not responding, timed out
Apr 24 15:04:59 pog kernel: nfs: server 192.168.100.2 not responding, timed out
Apr 24 15:04:59 pog kernel: nfs: server 192.168.100.2 not responding, timed out
Apr 24 15:04:59 pog kernel: nfs: server 192.168.100.2 not responding, timed out
Apr 24 15:04:59 pog kernel: nfs: server 192.168.100.2 not responding, timed out
Apr 24 15:04:59 pog kernel: nfs: server 192.168.100.2 not responding, timed out
Apr 24 15:05:59 pog kernel: nfs: server 192.168.100.2 not responding, still trying
Apr 24 16:13:42 pog kernel: nfs: server 192.168.100.2 not responding, timed out
Apr 24 16:13:42 pog kernel: nfs: server 192.168.100.2 not responding, timed out
Apr 24 16:13:42 pog kernel: nfs: server 192.168.100.2 not responding, timed out
Apr 24 16:13:42 pog kernel: nfs: server 192.168.100.2 not responding, timed out
Apr 24 16:13:42 pog kernel: nfs: server 192.168.100.2 not responding, timed out
Apr 24 16:13:42 pog kernel: nfs: server 192.168.100.2 not responding, timed out
Apr 24 16:15:12 pog kernel: nfs: server 192.168.100.2 not responding, still trying
Apr 24 16:18:08 pog kernel: nfs: server 192.168.100.2 not responding, still trying
Apr 24 16:19:45 pog kernel: nfs: server 192.168.100.2 not responding, still trying
Apr 24 16:22:51 pog kernel: nfs: server 192.168.100.2 not responding, still trying
Apr 24 16:26:26 pog kernel: nfs: server 192.168.100.2 not responding, still trying

I didn't notice it was down until 16:10 or so then I started doing the usual killing of programs to try to umount the nfs.  The server looked ok but I did a service nfs restart there a few times anyhow.  No recovery yet.

As before, I tried killing lots of progs and umount -f, but it's always the same:
#umount -f /data
umount2: Device or resource busy
umount.nfs4: /data: device is busy
umount2: Device or resource busy
umount.nfs4: /data: device is busy
Exit 16

lsof seems useless as it always hangs, or the -b option doesn't work in linux and I can't find any dev id's in linux for that option (the lsof man page has lots of info).

I'm attaching the new dmesg, though this is from *after* I had already tried much of the remediation stepts.

Prior to F8 I may have gotten the odd (like once a year) "not responding" error that was always either solved on its own or by a service nfs restart on the server.  Only in F10 do I get these impossible client hangs requiring reboot.

If there are any other ideas for workarounds to get the mount back up without a reboot, it would be most appreciated.

Equally baffling is why umount -f isn't working, and why my hard,intr options doen't seem to be working as they should!

Comment 19 Steve Dickson 2009-05-06 18:01:18 UTC

There was a recent fix in the umount code.... 

What version of the nfs-utils are you using? 

try to nfs-utils-1.1.6

Comment 20 Trevor Cordes 2009-05-06 19:13:11 UTC

nfs-utils-1.1.4-8.fc10.i386

Any idea when 1.1.6 will be pushed to F10 updates?  I prefer to stay out of testing repos, but I suppose for this I could try it.

I assume this will only help with the umount stuck issue and not the underlying nfs failure issue?

PS: since the last problem, I have changed my client mount point from in / to in /mnt, leaving a symlink from the old to the new location.  Perhaps that will help make the system less dead when it goes down... or not.  Certainly can't hurt.

Comment 21 Jerome Walters 2009-05-17 12:39:16 UTC

We experience exactly the same problem. Our client is Debian Testing (_Squeeze_) x86 – diskless node which uses nfsroot and boots from the server also Debian Testing (_Squeeze_) x86. While the client hang the server is responding to everyone else's requests. Restarting the nfsd on the server doesn't solve the problem.

At first I wasnt able to capture debug information on the client side since /var/log was mounted over the nfs, so I have installed a hard drive where I mounted only /var/log to be able to capture debug logs from the client as well.


Debug Logs: 
http://fixity.net/tmp/client.log.gz - Kernel RPC Debug Log from the client
http://fixity.net/tmp/server.log.gz - Kernel RPC Debug Log from the server


How reproducible:
Happens from 10 to 90 minutes after booting the diskless node.


Actual results:
NFS connections stop responding, system hangs or becomes very slow and unresponsive (it doesnt respond to Ctrl+Alt+Del as well). 60 to 90 minutes after the first server time out client says server OK but the client is still unresponsive. Immediately after that the client logs server connection loss again which leads to continues loop. Client is still unresponsive. Sometimes client resumes normal operation for couple of hours but then the problem repeats.


Connectivity info: 
Both the client and the server are connected to Gigabit Ethernet Cisco Metro series managable switch. Both of them use Intel Pro 82545GM Gigabit Ethernet Server Controllers. Neither one of them log any Ethernet errors and none are logged by the switch.


Client & Server Load:
For the purposes of testing both machines were only running needed daemons and weren't loaded at all.


Client & Server Kernel:
On both the client and server custom compiled linux 2.6.29.3 kernel was used. Configuration file @ http://fixity.net/tmp/config-2.6.29.3.gz


Client & Server Network interface fragmented packet queue length:
net.ipv4.ipfrag_high_thresh = 524288
net.ipv4.ipfrag_low_thresh = 393216


Client Versions:
libnfsidmap2/squeeze uptodate 0.21-2
nfs-common/squeeze uptodate 1:1.1.4-1


Client Mount (cat /proc/mounts | grep nfsroot):
10.11.11.1:/nfsroot / nfs rw,vers=3,rsize=524288,wsize=524288,namlen=255,hard,nointr,nolock,proto=tcp,timeo=7,retrans=10,sec=sys,addr=10.11.11.1 0 0


Client fstab:
proc            /proc           proc    defaults        0       0
/dev/nfs        /               nfs     defaults        1       1
none            /tmp            tmpfs   defaults        0       0
none            /var/run        tmpfs   defaults        0       0
none            /var/lock       tmpfs   defaults        0       0
none            /var/tmp        tmpfs   defaults        0       0


Client Daemons:
portmap, rpc.statd, rpc.idmapd


Server Daemons:
portmap, rpc.statd, rpc.idmapd, rpc.mountd --manage-gids


Server Versions:
libnfsidmap2/squeeze uptodate 0.21-2
nfs-common/squeeze uptodate 1:1.1.4-1
nfs-kernel-server/testing uptodate 1:1.1.4-1


Server Export:
/nfsroot 10.11.11.*(rw,no_root_squash,async,no_subtree_check)


Server Options:
RPCNFSDCOUNT=16
RPCNFSDPRIORITY=0
RPCMOUNTDOPTS=--manage-gids
NEED_SVCGSSD=no
RPCSVCGSSDOPTS=no


Additional Info:
Since I have read that tweaking the nfsroot mount options could improve the 
situation a have tested with different options as follows:
rsize/wsize=1024|2048|4096|8192|32768|524288
timeo=7|15|60|600
retrans=3|10|20
None resulted in solving the problem.


I have also tested with the following version on the client and server end without any difference in the behaviour:
libnfsidmap2/testing uptodate 0.21-2
nfs-common 1:1.1.6-1 newer than version in archive
nfs-kernel-server 1:1.1.6-1 newer than version in archive




I have been messing with that problem for the last couple of weeks and ran out of ideas.


Best Regards,
Jerome Walters

Comment 22 Trevor Cordes 2009-05-20 09:44:47 UTC

Good (bad?) to see it's not just Fedora.  Looks like a full blown vanilla kernel bug then.

Your symptoms don't exactly match mine but I'm sure on a diskless node the NFS bug would be more severe and probably hit a lot sooner/more often.

I should note that I have a similarity with you: 82545GM NIC in both boxes.  Rich, you too?

Also, I'm running 9k jumbo packets on server+switch+client.  You guys?

I also have tweaked ip settings a bit:
net.core.rmem_max=8388608
net.core.wmem_max=8388608
net.core.rmem_default=262143
net.core.wmem_default=262143

Comment 23 Rich Ercolani 2009-05-20 21:52:46 UTC

I'm using an 82573L on the client.

Comment 24 Trevor Cordes 2009-05-28 19:31:34 UTC

Just had the bug hit again, grrr.  My moving the mount point made no difference in how much it locks up my system.  Because so many apps access that mount my whole desktop gets very wonky when NFS goes down.

My logs had some weird new stuff in them this time.  Not sure why.  Anyhow, perhaps it may be useful.  Attaching (| grep -i nfs)

This most recent crash happened with kernel 2.6.27.21-170.2.56.fc10.i686.PAE and was fine since May 13.  Now I am running 2.6.27.24-170.2.68.fc10.i686.PAE and hoping that will help.

Comment 25 Trevor Cordes 2009-05-28 19:32:14 UTC

Created attachment 345827 [details]
/v/l/messages | grep -i nfs output from latest crash

Comment 26 Trevor Cordes 2009-05-28 19:34:22 UTC

Oops, maybe grep nfs wasn't the way to go.  Here's the entire bit from one specific instance in the log.  Attaching.  Hmm, very strange.

Comment 27 Trevor Cordes 2009-05-28 19:35:24 UTC

Created attachment 345828 [details]
better output from a weird /v/l/messages entry

Comment 28 Trevor Cordes 2009-05-28 19:41:05 UTC

As I look into this more... very strange.  My box has 4GB ECC RAM and 4GB swap.  Most of the time almost no swap is used.  I have no idea if something was running away before I rebooted... I didn't think to check RAM/swap usage.  Besides the usual NFS down app hanging issues, the system was behaving fine.

The last entry like this was May 26 20:45:48.  Then
May 28 00:20:43 pog kernel: nfs: server 192.168.100.2 not responding, still trying
Then when I started using the system for the first time since the night of May 26 (maybe around 20:00)
May 28 11:58:55 pog kernel: nfs: server 192.168.100.2 not responding, timed out

I'm not sure if these issues are related as all prior nfs issues did not have a page allocation failure that I know of.

I will add this of my things to watch before rebooting next time.

Comment 29 Trevor Cordes 2009-07-27 06:48:33 UTC

I recently upgraded my file server to F10 (client has been F10 for months).  This bug just hit me again after a few days.  Strange, but the bug hadn't hit for months (since my last post) when it was F10 client / F8 server.  But that was probably a fluke.

Both systems were fully updated and rebooted as of a few days ago.  This bug is still a major problem.

Comment 30 Trevor Cordes 2009-09-13 11:15:40 UTC

Bug just hit me again.  First time since last report.  I was doing some slightly unusual heavy recursive deletes (hundreds of thousands of files) as well as heavy reads/writes (backups) over NFS when it conked out.

This time, when doing a reboot to recover, linux hung on "shutting down... CIFS mounts" and I had to press reset.  The backups were accessing CIFS and NFS mounts simultaneously so the NFS hang must have hung the CIFS mounts in a bad way.  Strange, but the kernel breezed through the unmounting NFS part of shutdown immediately before hanging on the CIFS mounts.

Comment 31 Mark Hittinger 2009-10-19 14:39:54 UTC

I have been chasing this for a couple of months on Fedora 9.

Clients will randomly hang for a specific Fedora 9 NFS server.  Restarting NFS on
that server causes the client to come unstuck.

Only one client at a time will hang - I have not seen two hang at the same time.

These clients mount several Fedora 9 NFS servers.  Only one specific NFS server
is the culprit each time.

Here are the unusual things about that NFS server:

1.  It is on private IP space (172.30.0.12) and the clients are on a different
    subnet (10.110.131.0/24).  I am using a static route to access the NFS
    server.

2.  The NFS server servers 3 mount points (one is RO, two are a mix of RO/RW,
    RO to some, RW to some).

I have NFS servers on 172.30.0.0/24 that export only one file system to the
10.110.131.0/24 clients and they never hang on those.

In summary I think the problem is either triggered by multiple exports with
the ro/rw mix or something screwy with using the static route to access a
different subnet on the same lan.

Comment 32 Trevor Cordes 2009-10-25 04:17:52 UTC

Hmm, #31.  May be a different bug.  So far me, and I think the other reporters, have had no luck reviving hung clients by restarting daemons on the *server*.  With my own experience, doing server stuff doesn't seem to affect the hung client at all.

I serve just 1 export (well, 2 because NFSv4 requires a bogus root one, see below) and my bug hits.  So it would not seem to be dependent on # of exports.

cat /etc/exports
/nfs 192.168.100.1(ro,async,no_subtree_check,no_root_squash,insecure,fsid=0)
/nfs/data 192.168.100.1(rw,async,no_subtree_check,no_root_squash,insecure,nohide)

It would be nice if you could upgrade the servers to F10 or 11 and see what happens.  Maybe best to open a separate bug since the symptoms are different enough?  But they may say you'll have to upgrade to 10 to get support.

Have you tried killing all mount-using ps's and umounting the NFS mount on the client when the bug hits?  Use lsof to help kill them all.  Lately I've been having success without a client reboot by killing a zillion ps's (everything I do uses the mount it seems like) and remounting.

Comment 33 Mark Hittinger 2009-10-25 13:44:49 UTC

As you say it may be a different bug since restarting the server cures it.

I haven't tried F11 yet - I did try a 2.6.31 kernel but got mount error 521's -
apparently there are some NFS changes I need to research before going further
with the latest kernel.

When the client hangs its stuck - can log in on console as root but you can
never get to shell so we can't try umount -f and remount.

The next time it happens I am going to try to restart lockd on the server
instead of restarting nfsd and see if that brings the cure.  It sure is acting
like a lock issue.

Comment 34 Trevor Cordes 2009-10-26 10:35:48 UTC

What NFS version are you running?  3?  Perhaps 3 shows the bug in a different way, and server killing fixes it.

Are your important bin dirs (/usr for example) or /home or something else critical to bring up a shell mounted via this buggy NFS?  Weird that you can't login while it's frozen.  And when you kick the server all of a sudden the client returns to normal?

Comment 35 Steve Dickson 2009-10-26 12:48:32 UTC

Would it be possible to post a bzip2 tshark network trace? Something 
simlar to  
     tshark -w /tmp/data.pcap host <server> ; bzip2 /tmp/data.pcap

Comment 36 Mark Hittinger 2009-10-26 16:57:34 UTC

Yes our setup is a couple hundred F9 clients mounting a dozen or so F9 NFS
file servers.  We are using NFS 3.

I think we have /usr/local/sbin in the root path but it is not being served
by the "cursed" file server.

Yes when we /etc/init.d/nfs restart on the "cursed" file server whichever of the
clients that happened to have locked up will suddenly come alive.  It hasn't
happened yet but should happen within the next couple days.

Now will the tshark capture started after the "hang" is noticed be of any
value?  I can't predict which one of the clients will "hang".  I won't be able
to run it on the client but I would be able to run it on the file server after
I notice a client has hung.  Let me know if this is what you had in mind.

Comment 37 Steve Dickson 2009-10-26 17:46:30 UTC

> Now will the tshark capture started after the "hang" is noticed be of any
> value?
Probably not... Since you can not predict which client will hang, we
would definitely be in a 'needle in a haystack' scenario with all that
traffic... 

Is the server running out of memory? Theory being when the server is restarted
a bunch of memory is freed up, which allows things to proceed...

How about this, when the server hangs do a 'echo t > /proc/sysrq-trigger'
then a dmesg > /tmp/dmesg.log; then post a bzip2 version of the dmesg.log.bz2
This will hopefully show where the nfsd process are hanging...

Comment 38 Mark Hittinger 2009-10-26 21:53:45 UTC

The "cursed" file server is a Pentium-D with 4GB ram - no worries on running
out of ram.

Will try the dmesg thing when it happens again.

I have a 2.6.31.4 kernel with nfs-utils-1.2.0 ready to run on the "cursed"
server but it is not running yet.  I will wait for one more of these to happen.
The nfs-utils-1.2.0 resolves the mount error 521 issue I mentioned earlier.

Comment 39 Mark Hittinger 2009-11-02 16:11:29 UTC

Got one over the weekend.  One hung client.  Restarting NFSD on the "cursed"
server clears all ills.  Attached is the dmesg.log.bz2.  Earlier I had increased
the RPCNFSDCOUNT to 16.  If this is a problem I can drop it back down.  I will
probably get another one of these hangs sometime within the next week.

Comment 40 Mark Hittinger 2009-11-02 16:13:21 UTC

Created attachment 367158 [details]
dmesg output on nfs file server after /proc rq trigger

Comment 41 Steve Dickson 2009-11-04 16:19:31 UTC

Look at the  dmesg.log from Comment #40 it appears you are
running low on memory...

All your 17 nfsd process are hung in get_page_from_freelist()
plus a number of other process are hung handle_mm_fault() so
it appears there are large number process looking for 
memory pages.... 

Lets test this theory by decreasing the number of nfsd 
to either 8 or 9 (basically in half). This will slow things
down a bit  (assuming all those nfsd are actually being used)
but the system should not be so memory hungry which should
stop the hang...

Comment 42 Mark Hittinger 2009-11-04 17:11:04 UTC

I just dropped it back to 8.  I had raised it to 16 earlier to see if that
would make a difference.  I should see another one of these and will look
for the same memory issue.

Comment 43 Mark Hittinger 2009-11-09 19:49:10 UTC

Happened again with RPCNFSDCOUNT=8.  The /proc/sysrq-trigger trick shows
what appears to be the same memory issue.  In this case I wasn't able to get
to the client for several hours - about 10 hours.  When I got to the client it
had awoken from its coma and was fine.  It looks like it was in a coma for
about 7 hours.  I'm uneasy with a memory starvation explanation here and will
continue to fool around with it.  I never saw this under FC3.

Comment 44 Trevor Cordes 2009-11-18 10:04:00 UTC

I'm pretty sure we have 2 different bugs here, Rich's and mine (1) and Marks (2).  Maybe best if Mark open a new bug specific to his potential memory-related issue?

Steve, would it still be helpful to see a tshark/dmesg from my systems?  Do it on the client or server?  When my bug happens, I still have full control of both systems to login.

I just had the bug hit me right now (first time since last report here), but I had to get it running again fast (close a zillion windows, kill a zillion ps's, umount -f, mount) but I can do these steps next time.

My client now has 8GB RAM with 4GB unused swap at nearly all times (including now).  My server has 3GB with 10G swap (almost always unused).  My server is a Pentium D (like Mark's, interestingly enough).

server:
2.6.27.25-170.2.72.fc10.i686 #1 SMP Sun Jun 21 19:03:24 EDT 2009 i686 i686 i386 GNU/Linux

client:
2.6.27.38-170.2.113.fc10.i686.PAE #1 SMP Wed Nov 4 17:43:53 EST 2009 i686 i686 i386 GNU/Linux

I rarely update/reboot the file server but I could do a new kernel if you think it would help.

Comment 45 Bug Zapper 2009-11-18 11:09:22 UTC

This message is a reminder that Fedora 10 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 10.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '10'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 10's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 10 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 46 Mark Hittinger 2009-11-24 19:47:45 UTC

I thought I should make one more post under this bug Trevor, sorry about that.

Recently I had another client hang and was able to recover the client without
having to restart the NFSD server side.

(kill the processes using /a /b and /c - all exported from the same server)
client# umount -l /a ; umount -f /a
client# umount -l /b ; umount -f /b
client# umount -l /c ; umount -f /c

(at this point attempts to remount /a /b /c will hang)

client# /etc/init.d/nfslock restart

(now the mounts will succeed)

client# mount /a ; mount /b ; mount /c

and the client is recovered.  So my problem (and I suspect the others as well)
are lock related.  I am now trying to discover if the deadlock is client or
server side.

Comment 47 Bug Zapper 2009-12-18 07:58:01 UTC

Fedora 10 changed to end-of-life (EOL) status on 2009-12-17. Fedora 10 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 48 Steve Dickson 2010-01-18 14:55:10 UTC

Ok.. first I want to apologize for disappearing now this... Between 
deadlines and holidays seem to suck up my debugging cycles... 

Secondly I move this bug to F11 to hopefully eliminate  some of 
that Bug Zapper traffic..
> Trevor Cordes Writes:
> Steve, would it still be helpful to see a tshark/dmesg from my systems?  Do it
> on the client or server?  When my bug happens, I still have full control of
> both systems to login.
Yes... have information like this would definitely help... 

Trevor what does mean "My nfs4 server is 2.6.26.6-49.fc8" from
Comment 17. You are not truly using an FC8 nfsv4 server are you??

finally, has anybody seen this hang with kernel like 2.6.30 and 
above?

Comment 50 Trevor Cordes 2010-02-27 11:51:01 UTC

I have recently upgraded to F12 and will soon do F13, on the client.  My server is now F10 (upgraded shortly after that note about my server being F8, around May 2009).  I have had lots of hangs with F10 client + F10 server.

I haven't had a hang since putting in F12 client, but it hasn't been that long yet.  I am in no rush to upgrade the server so it always lags behind.  (It has a 7TB md RAID array and upgrading kernels scares the willies out of me.)

I'll try to capture some sniffs next time it happens.  I will always report back when it does, so if you don't hear from me in 2-3 months then the bug may be fixed by F12's newer kernel.  But I won't hold my breath...

Comment 51 Anton Starikov 2010-03-30 23:49:58 UTC

I have diskless FC-12 clients (root over NFS3) and I think I see exactly the same bug with /home automounted over NFS4. (And I also have seen it with earlier kernels).
Also random hangs after some period of time.
I can attach my dmesg after sysrq-trigger

Comment 52 Anton Starikov 2010-03-30 23:53:03 UTC

Created attachment 403617 [details]
logfile after sysrq "t"

I attached my log.

Comment 53 Mark Hittinger 2010-04-21 22:02:47 UTC

Some ethernet NICs (broadcom, intel gigabit) have something called
ASF enabled.  This causes the NIC to gobble stuff sent to port 623
and port 664!

RPC mounts avoid these ports by using sunrpc.min_resvport which is
set to 665 these days.  So the code for mounts won't use ports below
665.

Out of our roughly 300 hosts we have 14 boxes that have the affected NIC's.
These hosts had a number of problems with YP timeouts until I put in
a kludge to use xinetd to grab port 623 and 664.  Now the various RPC services
cannot use those ports.

Having done that I've not had a re-occurence of our NFS hang in 4 months
which is unusual - we would get it about once a week on average.

I'm thinking that some deadlock condition was being triggered by some
loss of lock/stat/rpc/etc packets due to these ASF enabled NICs.

BTW hosts that did not have the affected NICs would hang and the cursed NSF
server did not have one of the affected NICs.

My issue is resolved and I wanted to mention this glitch for anyone else
who might be running into something similar.

Comment 54 Bug Zapper 2010-04-27 13:01:11 UTC

This message is a reminder that Fedora 11 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 11.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '11'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 11's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 11 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 55 Trevor Cordes 2010-05-20 14:52:29 UTC

My setup doesn't/never used ports 623/664.  To get past my strict iptables, I've set up a very rigid port structure.  So that probably isn't my problem.

As an aside, which NICs are the ones you found problematic?

And, I haven't had this bug hit since my last report here (Nov 2009), so perhaps this has been fixed for me (still F12 client, F10 server).  I will keep reporting.

Comment 56 Trevor Cordes 2010-06-19 05:21:30 UTC

I had a new (somewhat different) NFS (server?) hang today, opening new bug #605884

I still have not seen this exact client bug since last report here.

Comment 57 Bug Zapper 2010-06-28 11:19:15 UTC

Fedora 11 changed to end-of-life (EOL) status on 2010-06-25. Fedora 11 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.