Description of problem: I'm getting stale nfs file handles on nfs clients in certain situations. Specifically, I see the error after a manual service _restart_ or by stopping it for config changes and restarting it after a while. However, if I manually remount from the client, everything works fine again. It only concerns clients which are addressed in netgroups. Version-Release number of selected component (if applicable): clumanager-1.2.3-1 Red Hat Enterprise Linux ES release 3 (Taroon Update 3) How reproducible: easy (very frequently) Steps to Reproduce: 1.start a nfs export service in clumanager with netgroup as nfs-client 2.mount nfs export on client 3.stop manually the nfs service 4.start manually the nfs service 5.try to access on the client the export Actual results: stale nfs handles Expected results: access to the nfs-mount Additional info:
1.2.16-1 is in the Cluster Suite channel on RHN (!) Steps 3 and 4: Do you mean "service nfs stop / service nfs start", or using the cluster tools to restart it?
> 1.2.16-1 is in the Cluster Suite channel on RHN (!) I know, but I have a working system, and because of this error I cannot restart the cluster! :-( >Steps 3 and 4: Do you mean "service nfs stop / service nfs start", >or using the cluster tools to restart it? I'm using the cluster tool to manage the cluster-services. The cluster tool gives the possibility to enable/disable and restart a service. So that steps 3/4 should be: 3. disable nfs-service by cluster-tool 4. enable nfs-service by cluster-tool or restart nfs-service by cluster-tool I'm sorry for the misunderstanding.
Not really a misunderstanding; just want to have everything clear. WRT to upgrading, you can do it in a 'rolling' fashion; details are in the errata advisories. This should minimize downtime (because you don't have to take the whole cluster offline to do it; just one node at a time). Few more questions: (1) Is autofs (automount) used in conjunction with the clients? If so, what is the mount timeout? (2) How would you characterize the the clients receiving ESTALE (e.g. all netgroup members/some netgroup members/all clients [inside and outside of netgroup]/some clients [random, not specific to netgroup])? (3) What are the entries in /var/lib/nfs/rmtab and <service-mountpoint>/.clumanager/rmtab? (4) When the service is running, there should be a copy of "clurmtabd" running for each export path (but not per client). Is this the case?
>WRT to upgrading, you can do it in a 'rolling' fashion; details are >in the errata advisories. This should minimize downtime (because you >don't have to take the whole cluster offline to do it; just one node >at a time). sure, but I want to solve the problem with the stale nfs handles before I have downtime ;-) Answers: 1) Nope. There is no use of autofs 2) Sll netgroup members - expect those which are seperatly noted with special mount/export options. 3) The clients listed seperatly; /var/lib/nfs/rmtab contains the same entries as <service-mountpoint>/.clumanager/rmtab Was it that you wanted to know??? 4) Thats the case! I have just another question: Why isn't it possible to make changes to the exports and reload the nfs service?? The only way to do it is to take down the service, make your changes and then get it online again. But for some additional entries in the export this is exorbitant.
Ok, so it's just the netgroup clients. That should make it easier. The answer to your question lies in the way services are defined. They're more or less monolithic with lots of properties as opposed to modeled as a tree of separate entities combined in a group. This is a known architectural limitation. It should go away in the future (next major release of RHCS).
Hmmmmm... Did you remove any NFS clients from the service?
> Hmmmmm... Did you remove any NFS clients from the service? What do you mean? No, I didn't removed some clients, so that these couldn't mount the exports ;-) There's no need to make any changes to the service, if it goes down and up, the netgroup-clients get stale nfs-handles. I'm not sure about the point how the behavior is if relocating the service - I remember this worked well.
Thanks for the information. A similar problem occurs (apparently) with wildcards, and yet another with many individual exports. I have thus far been unable to reproduce any of the above. Could you attach your cluster.xml (you can change your IPs/hostnames if you're paranoid about it, but please don't change anything else)?
Created attachment 104357 [details] cluster.xml
Are you using the YP server to serve netgroups to the cluster? More specifically, are netgroups from your clustered YP service being used by your clustered NFS services?
Not really - its just a YP-Slave for your network. But sure, the yp-server for the cluster is the slave. Is there anything wrong??
It should be fine; we are just collecting as much data as we can so we can try to figure out what's wrong. Thanks for your patience.
Created attachment 104412 [details] Should fix problem
thanks for the patch. I will include it to the new clumanager before updating and inform you about the results (this will take some time).
thanks again for the patch - it seems working correctly now.