Bug 486264
Summary: | nfs hangs periodically on only the Fedora client | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Rich Ercolani <rercola> |
Component: | nfs-utils | Assignee: | Steve Dickson <steved> |
Status: | CLOSED WONTFIX | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | medium | Docs Contact: | |
Priority: | low | ||
Version: | 11 | CC: | ant.starikov, bugs, jeronimo, jlayton, kgroutsis, leo, prgarcial, rwheeler, steved, trevor |
Target Milestone: | --- | Keywords: | Reopened, Triaged |
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2010-06-28 11:19:15 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 516998 | ||
Attachments: |
Description
Rich Ercolani
2009-02-19 05:59:59 UTC
A few weeks ago I upgraded from F8 to F10. Everything has been pretty good except my always rock-solid NFS connection is now dying every 5-15 days as per the above. My client is F10. My NFS server is F8. They worked great for as long as I can remember when they were both F8. service nfs restart on server does NOT bring my connection back. Neither does restarting the rpcd's I'm running on the server. I am using NFSv4. 192.168.100.2:/data /data nfs4 rw,bg,hard,intr,noatime,nosuid,proto=tcp,timeo=15,rsize=8192,wsize=8192 0 0 I noticed something very interesting, when the connection dies I get the normal "still trying" log, but only 1 of them! In the F8 days if the server was not responding, I'd get hundreds of periodic "still trying" until the server was back up. F10 it just says this ONCE and then it's like it just gives up for good or is hung somehow. Mar 5 07:56:03 pog kernel: nfs: server 192.168.100.2 not responding, still trying There are no errors showing up on the server. The link between the two computers is good as I have multiple ssh sessions open during this time which did not have any problems. Since my whole system can't live without /data, it's nearly impossible for me to unmount this nfs without rebooting. Ah, I found a workaround to get the connection back up without rebooting. umount -f /data mount /data The -f forces the nfs down even though there are open files. I tried in vain to find every ps that was using /data and had kill 50% of my running ps's yet umount (without -f) would still complain about "device busy" and lsof /data just hangs. Not sure how many nasty side effects -f will have but this works for me for now. The fact the above works shows that the server is probably fine and the network is fine. Obviously this is not a real, long-term solution. Is autofs in the picture at all? When the connection breaks please run (as root) rpcdebug -m rpc -s all The recored what is in the dmesgs dmesg > /tmp/dmesg I'm looking for any type of failures the client is having... Note: to turn off the debugging do: rpcdebug -m rpc -c all I'll run that debug next time it fails. How can I tell if I'm running autofs? I don't seem to have any autofs rpm installed. In contrast: I'm using autofs, but it's being used both on the Fedora client and some other machines that stay up the whole time. Created attachment 335078 [details]
A round of output from rpcdebug during one such hang
PS: According to the NFS server, it's replying to thousands[!] of requests per second from this client, all with the same response code: NFS4ERR_STALE_CLIENTID. Any idea how the clientid would have become stale, and how this problem only ever affects this machine? http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=a71ee337b31271e701f689d544b6153b75609bc5 or http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=027b6ca02192f381a5a91237ba8a8cf625dc6f6a seem relevant. Strange, but this bug has not hit me again since my last post on Mar 5. I will keep watching it and run the debug stuff next time. This just hit me overnight. Was fine when I went to sleep. Woke up and NFS was hung on the client. I captured the debug output but I can't interpret it. Will attach. I don't see the stale thing. I'm going to try to recover now without rebooting... a tricky task since I rely heavily on that file server. Created attachment 339810 [details]
dmesg output from nfs-hung client
Argh, I had to reboot to recover from it this time. umount -f wouldn't work (busy/open). I've thought about using soft mounting until this is resolved but the man page freaks me out when it talks about data loss using soft! My entire computer becomes useless with about half the apps hanging when my NFS link goes down. Using the "intr" mount option might be an acceptable interim solution, if you're using nfs4 (which you appear to be). I just checked and I already was using intr! That's bizarre because some of the apps that had /data as a cwd just wouldn't die. I tried closing everything but some stuff just hung around and I couldn't get it to umount. That happens nearly every time my nfs4 goes wonky. Like I said, everything depends on my 5TB file server. I am using nfs4. Just to be clear... What kernel versions is this happening on (i.e. uname -r)? Also I am not aware that Sun has released Solaris 11... Do you mean OpenSolaris? If so which flavor? When it died previously in late Feb I was using: 2.6.27.15-170.2.24.fc10.i686.PAE When it died today, I was using: 2.6.27.19-170.2.35.fc10.i686.PAE Right now I am using, since the reboot: 2.6.27.21-170.2.56.fc10.i686.PAE We'll see if .21 dies... My nfs4 server is 2.6.26.6-49.fc8 Argh, just happened again, though a bit different this time. I got a bunch of not responding errors on the client this time (previously I didn't see any). Apr 24 14:53:40 pog kernel: nfs: server 192.168.100.2 not responding, still trying Apr 24 15:04:59 pog kernel: nfs: server 192.168.100.2 not responding, timed out Apr 24 15:04:59 pog kernel: nfs: server 192.168.100.2 not responding, timed out Apr 24 15:04:59 pog kernel: nfs: server 192.168.100.2 not responding, timed out Apr 24 15:04:59 pog kernel: nfs: server 192.168.100.2 not responding, timed out Apr 24 15:04:59 pog kernel: nfs: server 192.168.100.2 not responding, timed out Apr 24 15:04:59 pog kernel: nfs: server 192.168.100.2 not responding, timed out Apr 24 15:04:59 pog kernel: nfs: server 192.168.100.2 not responding, timed out Apr 24 15:05:59 pog kernel: nfs: server 192.168.100.2 not responding, still trying Apr 24 16:13:42 pog kernel: nfs: server 192.168.100.2 not responding, timed out Apr 24 16:13:42 pog kernel: nfs: server 192.168.100.2 not responding, timed out Apr 24 16:13:42 pog kernel: nfs: server 192.168.100.2 not responding, timed out Apr 24 16:13:42 pog kernel: nfs: server 192.168.100.2 not responding, timed out Apr 24 16:13:42 pog kernel: nfs: server 192.168.100.2 not responding, timed out Apr 24 16:13:42 pog kernel: nfs: server 192.168.100.2 not responding, timed out Apr 24 16:15:12 pog kernel: nfs: server 192.168.100.2 not responding, still trying Apr 24 16:18:08 pog kernel: nfs: server 192.168.100.2 not responding, still trying Apr 24 16:19:45 pog kernel: nfs: server 192.168.100.2 not responding, still trying Apr 24 16:22:51 pog kernel: nfs: server 192.168.100.2 not responding, still trying Apr 24 16:26:26 pog kernel: nfs: server 192.168.100.2 not responding, still trying I didn't notice it was down until 16:10 or so then I started doing the usual killing of programs to try to umount the nfs. The server looked ok but I did a service nfs restart there a few times anyhow. No recovery yet. As before, I tried killing lots of progs and umount -f, but it's always the same: #umount -f /data umount2: Device or resource busy umount.nfs4: /data: device is busy umount2: Device or resource busy umount.nfs4: /data: device is busy Exit 16 lsof seems useless as it always hangs, or the -b option doesn't work in linux and I can't find any dev id's in linux for that option (the lsof man page has lots of info). I'm attaching the new dmesg, though this is from *after* I had already tried much of the remediation stepts. Prior to F8 I may have gotten the odd (like once a year) "not responding" error that was always either solved on its own or by a service nfs restart on the server. Only in F10 do I get these impossible client hangs requiring reboot. If there are any other ideas for workarounds to get the mount back up without a reboot, it would be most appreciated. Equally baffling is why umount -f isn't working, and why my hard,intr options doen't seem to be working as they should! There was a recent fix in the umount code.... What version of the nfs-utils are you using? try to nfs-utils-1.1.6 nfs-utils-1.1.4-8.fc10.i386 Any idea when 1.1.6 will be pushed to F10 updates? I prefer to stay out of testing repos, but I suppose for this I could try it. I assume this will only help with the umount stuck issue and not the underlying nfs failure issue? PS: since the last problem, I have changed my client mount point from in / to in /mnt, leaving a symlink from the old to the new location. Perhaps that will help make the system less dead when it goes down... or not. Certainly can't hurt. We experience exactly the same problem. Our client is Debian Testing (_Squeeze_) x86 – diskless node which uses nfsroot and boots from the server also Debian Testing (_Squeeze_) x86. While the client hang the server is responding to everyone else's requests. Restarting the nfsd on the server doesn't solve the problem. At first I wasnt able to capture debug information on the client side since /var/log was mounted over the nfs, so I have installed a hard drive where I mounted only /var/log to be able to capture debug logs from the client as well. Debug Logs: http://fixity.net/tmp/client.log.gz - Kernel RPC Debug Log from the client http://fixity.net/tmp/server.log.gz - Kernel RPC Debug Log from the server How reproducible: Happens from 10 to 90 minutes after booting the diskless node. Actual results: NFS connections stop responding, system hangs or becomes very slow and unresponsive (it doesnt respond to Ctrl+Alt+Del as well). 60 to 90 minutes after the first server time out client says server OK but the client is still unresponsive. Immediately after that the client logs server connection loss again which leads to continues loop. Client is still unresponsive. Sometimes client resumes normal operation for couple of hours but then the problem repeats. Connectivity info: Both the client and the server are connected to Gigabit Ethernet Cisco Metro series managable switch. Both of them use Intel Pro 82545GM Gigabit Ethernet Server Controllers. Neither one of them log any Ethernet errors and none are logged by the switch. Client & Server Load: For the purposes of testing both machines were only running needed daemons and weren't loaded at all. Client & Server Kernel: On both the client and server custom compiled linux 2.6.29.3 kernel was used. Configuration file @ http://fixity.net/tmp/config-2.6.29.3.gz Client & Server Network interface fragmented packet queue length: net.ipv4.ipfrag_high_thresh = 524288 net.ipv4.ipfrag_low_thresh = 393216 Client Versions: libnfsidmap2/squeeze uptodate 0.21-2 nfs-common/squeeze uptodate 1:1.1.4-1 Client Mount (cat /proc/mounts | grep nfsroot): 10.11.11.1:/nfsroot / nfs rw,vers=3,rsize=524288,wsize=524288,namlen=255,hard,nointr,nolock,proto=tcp,timeo=7,retrans=10,sec=sys,addr=10.11.11.1 0 0 Client fstab: proc /proc proc defaults 0 0 /dev/nfs / nfs defaults 1 1 none /tmp tmpfs defaults 0 0 none /var/run tmpfs defaults 0 0 none /var/lock tmpfs defaults 0 0 none /var/tmp tmpfs defaults 0 0 Client Daemons: portmap, rpc.statd, rpc.idmapd Server Daemons: portmap, rpc.statd, rpc.idmapd, rpc.mountd --manage-gids Server Versions: libnfsidmap2/squeeze uptodate 0.21-2 nfs-common/squeeze uptodate 1:1.1.4-1 nfs-kernel-server/testing uptodate 1:1.1.4-1 Server Export: /nfsroot 10.11.11.*(rw,no_root_squash,async,no_subtree_check) Server Options: RPCNFSDCOUNT=16 RPCNFSDPRIORITY=0 RPCMOUNTDOPTS=--manage-gids NEED_SVCGSSD=no RPCSVCGSSDOPTS=no Additional Info: Since I have read that tweaking the nfsroot mount options could improve the situation a have tested with different options as follows: rsize/wsize=1024|2048|4096|8192|32768|524288 timeo=7|15|60|600 retrans=3|10|20 None resulted in solving the problem. I have also tested with the following version on the client and server end without any difference in the behaviour: libnfsidmap2/testing uptodate 0.21-2 nfs-common 1:1.1.6-1 newer than version in archive nfs-kernel-server 1:1.1.6-1 newer than version in archive I have been messing with that problem for the last couple of weeks and ran out of ideas. Best Regards, Jerome Walters Good (bad?) to see it's not just Fedora. Looks like a full blown vanilla kernel bug then. Your symptoms don't exactly match mine but I'm sure on a diskless node the NFS bug would be more severe and probably hit a lot sooner/more often. I should note that I have a similarity with you: 82545GM NIC in both boxes. Rich, you too? Also, I'm running 9k jumbo packets on server+switch+client. You guys? I also have tweaked ip settings a bit: net.core.rmem_max=8388608 net.core.wmem_max=8388608 net.core.rmem_default=262143 net.core.wmem_default=262143 I'm using an 82573L on the client. Just had the bug hit again, grrr. My moving the mount point made no difference in how much it locks up my system. Because so many apps access that mount my whole desktop gets very wonky when NFS goes down. My logs had some weird new stuff in them this time. Not sure why. Anyhow, perhaps it may be useful. Attaching (| grep -i nfs) This most recent crash happened with kernel 2.6.27.21-170.2.56.fc10.i686.PAE and was fine since May 13. Now I am running 2.6.27.24-170.2.68.fc10.i686.PAE and hoping that will help. Created attachment 345827 [details]
/v/l/messages | grep -i nfs output from latest crash
Oops, maybe grep nfs wasn't the way to go. Here's the entire bit from one specific instance in the log. Attaching. Hmm, very strange. Created attachment 345828 [details]
better output from a weird /v/l/messages entry
As I look into this more... very strange. My box has 4GB ECC RAM and 4GB swap. Most of the time almost no swap is used. I have no idea if something was running away before I rebooted... I didn't think to check RAM/swap usage. Besides the usual NFS down app hanging issues, the system was behaving fine. The last entry like this was May 26 20:45:48. Then May 28 00:20:43 pog kernel: nfs: server 192.168.100.2 not responding, still trying Then when I started using the system for the first time since the night of May 26 (maybe around 20:00) May 28 11:58:55 pog kernel: nfs: server 192.168.100.2 not responding, timed out I'm not sure if these issues are related as all prior nfs issues did not have a page allocation failure that I know of. I will add this of my things to watch before rebooting next time. I recently upgraded my file server to F10 (client has been F10 for months). This bug just hit me again after a few days. Strange, but the bug hadn't hit for months (since my last post) when it was F10 client / F8 server. But that was probably a fluke. Both systems were fully updated and rebooted as of a few days ago. This bug is still a major problem. Bug just hit me again. First time since last report. I was doing some slightly unusual heavy recursive deletes (hundreds of thousands of files) as well as heavy reads/writes (backups) over NFS when it conked out. This time, when doing a reboot to recover, linux hung on "shutting down... CIFS mounts" and I had to press reset. The backups were accessing CIFS and NFS mounts simultaneously so the NFS hang must have hung the CIFS mounts in a bad way. Strange, but the kernel breezed through the unmounting NFS part of shutdown immediately before hanging on the CIFS mounts. I have been chasing this for a couple of months on Fedora 9. Clients will randomly hang for a specific Fedora 9 NFS server. Restarting NFS on that server causes the client to come unstuck. Only one client at a time will hang - I have not seen two hang at the same time. These clients mount several Fedora 9 NFS servers. Only one specific NFS server is the culprit each time. Here are the unusual things about that NFS server: 1. It is on private IP space (172.30.0.12) and the clients are on a different subnet (10.110.131.0/24). I am using a static route to access the NFS server. 2. The NFS server servers 3 mount points (one is RO, two are a mix of RO/RW, RO to some, RW to some). I have NFS servers on 172.30.0.0/24 that export only one file system to the 10.110.131.0/24 clients and they never hang on those. In summary I think the problem is either triggered by multiple exports with the ro/rw mix or something screwy with using the static route to access a different subnet on the same lan. Hmm, #31. May be a different bug. So far me, and I think the other reporters, have had no luck reviving hung clients by restarting daemons on the *server*. With my own experience, doing server stuff doesn't seem to affect the hung client at all. I serve just 1 export (well, 2 because NFSv4 requires a bogus root one, see below) and my bug hits. So it would not seem to be dependent on # of exports. cat /etc/exports /nfs 192.168.100.1(ro,async,no_subtree_check,no_root_squash,insecure,fsid=0) /nfs/data 192.168.100.1(rw,async,no_subtree_check,no_root_squash,insecure,nohide) It would be nice if you could upgrade the servers to F10 or 11 and see what happens. Maybe best to open a separate bug since the symptoms are different enough? But they may say you'll have to upgrade to 10 to get support. Have you tried killing all mount-using ps's and umounting the NFS mount on the client when the bug hits? Use lsof to help kill them all. Lately I've been having success without a client reboot by killing a zillion ps's (everything I do uses the mount it seems like) and remounting. As you say it may be a different bug since restarting the server cures it. I haven't tried F11 yet - I did try a 2.6.31 kernel but got mount error 521's - apparently there are some NFS changes I need to research before going further with the latest kernel. When the client hangs its stuck - can log in on console as root but you can never get to shell so we can't try umount -f and remount. The next time it happens I am going to try to restart lockd on the server instead of restarting nfsd and see if that brings the cure. It sure is acting like a lock issue. What NFS version are you running? 3? Perhaps 3 shows the bug in a different way, and server killing fixes it. Are your important bin dirs (/usr for example) or /home or something else critical to bring up a shell mounted via this buggy NFS? Weird that you can't login while it's frozen. And when you kick the server all of a sudden the client returns to normal? Would it be possible to post a bzip2 tshark network trace? Something simlar to tshark -w /tmp/data.pcap host <server> ; bzip2 /tmp/data.pcap Yes our setup is a couple hundred F9 clients mounting a dozen or so F9 NFS file servers. We are using NFS 3. I think we have /usr/local/sbin in the root path but it is not being served by the "cursed" file server. Yes when we /etc/init.d/nfs restart on the "cursed" file server whichever of the clients that happened to have locked up will suddenly come alive. It hasn't happened yet but should happen within the next couple days. Now will the tshark capture started after the "hang" is noticed be of any value? I can't predict which one of the clients will "hang". I won't be able to run it on the client but I would be able to run it on the file server after I notice a client has hung. Let me know if this is what you had in mind. > Now will the tshark capture started after the "hang" is noticed be of any > value? Probably not... Since you can not predict which client will hang, we would definitely be in a 'needle in a haystack' scenario with all that traffic... Is the server running out of memory? Theory being when the server is restarted a bunch of memory is freed up, which allows things to proceed... How about this, when the server hangs do a 'echo t > /proc/sysrq-trigger' then a dmesg > /tmp/dmesg.log; then post a bzip2 version of the dmesg.log.bz2 This will hopefully show where the nfsd process are hanging... The "cursed" file server is a Pentium-D with 4GB ram - no worries on running out of ram. Will try the dmesg thing when it happens again. I have a 2.6.31.4 kernel with nfs-utils-1.2.0 ready to run on the "cursed" server but it is not running yet. I will wait for one more of these to happen. The nfs-utils-1.2.0 resolves the mount error 521 issue I mentioned earlier. Got one over the weekend. One hung client. Restarting NFSD on the "cursed" server clears all ills. Attached is the dmesg.log.bz2. Earlier I had increased the RPCNFSDCOUNT to 16. If this is a problem I can drop it back down. I will probably get another one of these hangs sometime within the next week. Created attachment 367158 [details]
dmesg output on nfs file server after /proc rq trigger
Look at the dmesg.log from Comment #40 it appears you are running low on memory... All your 17 nfsd process are hung in get_page_from_freelist() plus a number of other process are hung handle_mm_fault() so it appears there are large number process looking for memory pages.... Lets test this theory by decreasing the number of nfsd to either 8 or 9 (basically in half). This will slow things down a bit (assuming all those nfsd are actually being used) but the system should not be so memory hungry which should stop the hang... I just dropped it back to 8. I had raised it to 16 earlier to see if that would make a difference. I should see another one of these and will look for the same memory issue. Happened again with RPCNFSDCOUNT=8. The /proc/sysrq-trigger trick shows what appears to be the same memory issue. In this case I wasn't able to get to the client for several hours - about 10 hours. When I got to the client it had awoken from its coma and was fine. It looks like it was in a coma for about 7 hours. I'm uneasy with a memory starvation explanation here and will continue to fool around with it. I never saw this under FC3. I'm pretty sure we have 2 different bugs here, Rich's and mine (1) and Marks (2). Maybe best if Mark open a new bug specific to his potential memory-related issue? Steve, would it still be helpful to see a tshark/dmesg from my systems? Do it on the client or server? When my bug happens, I still have full control of both systems to login. I just had the bug hit me right now (first time since last report here), but I had to get it running again fast (close a zillion windows, kill a zillion ps's, umount -f, mount) but I can do these steps next time. My client now has 8GB RAM with 4GB unused swap at nearly all times (including now). My server has 3GB with 10G swap (almost always unused). My server is a Pentium D (like Mark's, interestingly enough). server: 2.6.27.25-170.2.72.fc10.i686 #1 SMP Sun Jun 21 19:03:24 EDT 2009 i686 i686 i386 GNU/Linux client: 2.6.27.38-170.2.113.fc10.i686.PAE #1 SMP Wed Nov 4 17:43:53 EST 2009 i686 i686 i386 GNU/Linux I rarely update/reboot the file server but I could do a new kernel if you think it would help. This message is a reminder that Fedora 10 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 10. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '10'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 10's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 10 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping I thought I should make one more post under this bug Trevor, sorry about that. Recently I had another client hang and was able to recover the client without having to restart the NFSD server side. (kill the processes using /a /b and /c - all exported from the same server) client# umount -l /a ; umount -f /a client# umount -l /b ; umount -f /b client# umount -l /c ; umount -f /c (at this point attempts to remount /a /b /c will hang) client# /etc/init.d/nfslock restart (now the mounts will succeed) client# mount /a ; mount /b ; mount /c and the client is recovered. So my problem (and I suspect the others as well) are lock related. I am now trying to discover if the deadlock is client or server side. Fedora 10 changed to end-of-life (EOL) status on 2009-12-17. Fedora 10 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed. Ok.. first I want to apologize for disappearing now this... Between deadlines and holidays seem to suck up my debugging cycles... Secondly I move this bug to F11 to hopefully eliminate some of that Bug Zapper traffic.. > Trevor Cordes Writes: > Steve, would it still be helpful to see a tshark/dmesg from my systems? Do it > on the client or server? When my bug happens, I still have full control of > both systems to login. Yes... have information like this would definitely help... Trevor what does mean "My nfs4 server is 2.6.26.6-49.fc8" from Comment 17. You are not truly using an FC8 nfsv4 server are you?? finally, has anybody seen this hang with kernel like 2.6.30 and above? I have recently upgraded to F12 and will soon do F13, on the client. My server is now F10 (upgraded shortly after that note about my server being F8, around May 2009). I have had lots of hangs with F10 client + F10 server. I haven't had a hang since putting in F12 client, but it hasn't been that long yet. I am in no rush to upgrade the server so it always lags behind. (It has a 7TB md RAID array and upgrading kernels scares the willies out of me.) I'll try to capture some sniffs next time it happens. I will always report back when it does, so if you don't hear from me in 2-3 months then the bug may be fixed by F12's newer kernel. But I won't hold my breath... I have diskless FC-12 clients (root over NFS3) and I think I see exactly the same bug with /home automounted over NFS4. (And I also have seen it with earlier kernels). Also random hangs after some period of time. I can attach my dmesg after sysrq-trigger Created attachment 403617 [details]
logfile after sysrq "t"
I attached my log.
Some ethernet NICs (broadcom, intel gigabit) have something called ASF enabled. This causes the NIC to gobble stuff sent to port 623 and port 664! RPC mounts avoid these ports by using sunrpc.min_resvport which is set to 665 these days. So the code for mounts won't use ports below 665. Out of our roughly 300 hosts we have 14 boxes that have the affected NIC's. These hosts had a number of problems with YP timeouts until I put in a kludge to use xinetd to grab port 623 and 664. Now the various RPC services cannot use those ports. Having done that I've not had a re-occurence of our NFS hang in 4 months which is unusual - we would get it about once a week on average. I'm thinking that some deadlock condition was being triggered by some loss of lock/stat/rpc/etc packets due to these ASF enabled NICs. BTW hosts that did not have the affected NICs would hang and the cursed NSF server did not have one of the affected NICs. My issue is resolved and I wanted to mention this glitch for anyone else who might be running into something similar. This message is a reminder that Fedora 11 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 11. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '11'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 11's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 11 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping My setup doesn't/never used ports 623/664. To get past my strict iptables, I've set up a very rigid port structure. So that probably isn't my problem. As an aside, which NICs are the ones you found problematic? And, I haven't had this bug hit since my last report here (Nov 2009), so perhaps this has been fixed for me (still F12 client, F10 server). I will keep reporting. I had a new (somewhat different) NFS (server?) hang today, opening new bug #605884 I still have not seen this exact client bug since last report here. Fedora 11 changed to end-of-life (EOL) status on 2010-06-25. Fedora 11 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed. |