Bug 369991

Summary: tcp nfs client io can deadlock for 1 - 15 minutes during service relocation
Product: Red Hat Enterprise Linux 5 Reporter: Corey Marthaler <cmarthal>
Component: kernelAssignee: Lon Hohberger <lhh>
Status: CLOSED WONTFIX QA Contact: Cluster QE <mspqa-list>
Severity: low Docs Contact:
Priority: low    
Version: 5.0CC: jeder, jlayton, jruemker, k.georgiou, lhh, nhorman, stanislav.polasek, staubach, steved
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-04-01 18:10:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
first part of binary capture of ACK storm starting
none
failover 2 none

Description Corey Marthaler 2007-11-07 17:43:53 UTC
Description of problem:
Lon and I were noticing that nfs client I/O (when mounted w/ tcp) can deadlock
for 1 to 15 minutes after a service relocation has taken place. When mounted
with udp, the I/O pause is never more then 10 seconds or so.

0 include/linux/nfs_fs.h <global>                 22 #define NFS_MAX_TCP_TIMEOUT
(600*HZ)

0 include/linux/nfs_fs.h <global>                 21 #define NFS_MAX_UDP_TIMEOUT
(60*HZ)

This is what the config looks like:
[root@hayes-01 ~]# clustat
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  hayes-01                              1 Online, Local, rgmanager
  hayes-02                              2 Online, rgmanager
  hayes-03                              3 Online, rgmanager

  Service Name         Owner (Last)                   State
  ------- ----         ----- ------                   -----
  service:EXT1         hayes-02                       started
  service:EXT2         hayes-01                       started

  <rm>
   <failoverdomains>
                      <failoverdomain name="ALL" ordered="0" restricted="0">
                                <failoverdomainnode name="hayes-02" priority="1"/>
                                <failoverdomainnode name="hayes-03" priority="1"/>
                                <failoverdomainnode name="hayes-01" priority="1"/>
                        </failoverdomain>
   </failoverdomains>
   <resources>
      <ip address="10.15.89.209" monitor_link="1"/>
      <ip address="10.15.89.210" monitor_link="1"/>
      <fs device="/dev/hayes/aoe1" force_unmount="1" fstype="ext3"
mountpoint="/mnt/ext1" name="ext1" options=""/>
      <fs device="/dev/hayes/aoe2" force_unmount="1" fstype="ext3"
mountpoint="/mnt/ext2" name="ext2" options=""/>
      <nfsexport name="nfs exports"/>
      <nfsclient name="all" options="rw" target="*"/>
   </resources>
   <service autostart="1" domain="ALL" name="EXT1">
            <fs ref="ext1">
                    <nfsexport ref="nfs exports">
                             <nfsclient ref="all"/>
                    </nfsexport>
            </fs>
            <ip ref="10.15.89.209"/>
   </service>
   <service autostart="1" domain="ALL" name="EXT2">
            <fs ref="ext2">
                    <nfsexport ref="nfs exports">
                             <nfsclient ref="all"/>
                    </nfsexport>
            </fs>
            <ip ref="10.15.89.210"/>
   </service>
  </rm>


Version-Release number of selected component (if applicable):
2.6.18-53.el5
rgmanager-2.0.31-1.el5

Comment 1 Lon Hohberger 2007-11-09 20:52:47 UTC
This is a kernel bug.  The TCP NFS client in RHEL5 is supposed to retry.  Maybe
it didn't flush the ARP cache or something and was trying to send to the old
hardware address?

Comment 2 Lon Hohberger 2007-11-14 18:44:04 UTC
This doesn't look specific to rgmanager, and it's on the *client* side anyway. 
Basically, sometimes recovery (e.g. I/o being resumed on the client) happens
quickly (a few seconds).  While at other times, it takes 10-15 minutes to resume.

Note that this problem seems to be specific to NFS over TCP; NFS over UDP
routinely recovers quickly.

Comment 3 Lon Hohberger 2007-12-20 19:18:21 UTC
In linux/include/linux/net_fs.h:

#define NFS_MAX_TCP_TIMEOUT     (600*HZ)

I wonder if this is what's causing our problem...

Comment 4 Nate Straz 2008-01-22 18:12:20 UTC
In recent 5.1.Z testing I've seen hangs that last for days with tcp mounts
(which are the default).

Comment 5 Jeff Layton 2008-05-20 19:37:12 UTC
Rescuing this from the fedora kernel queue...

Yes, we do crank up the RPC timeouts on TCP since it's desirable to allow the
TCP layer to handle retransmissions rather than having the RPC layer do it.

This really sounds like something around the TCP or IP layer -- it might be a
problem with ARP timeouts, but if Nate's comment #4 is correct, then maybe
there's something else going on.

Some network captures from the client when this is happening might be
interesting -- maybe see if we can see whether packets are leaving the client
box, and where they're being sent.

Comment 6 Peter Grandi 2008-05-26 15:52:42 UTC
The fundamental problem is that a lot of people misunderstand what it means for TCP to be 
reliable: it is a reliable data protocol, but not a reliable session protocol.

That is, if a data packet is lost, the TCP stacks will retransmit; but if the session is lost, the 
TCP stacks will not attempt to reconnect. The NFS client for TCP is written assuming that TCP is 
fully reliable, and will not attempt to reconnect if the session dies, but that for UDP assumes 
that UDP is fully unreliable, and will retry.

Comment 7 Jeff Layton 2008-05-27 11:25:56 UTC
All true, but hanging for tens of minutes is decidedly non-optimal. I have to
wonder if there might be something we can do to help the client speed along
reconnecting in this situation.

The first thing we need to do is verify what the client is actually doing during
these 10-15 minutes. The TCP layer will be resending frames, but does the RPC
layer ever get an error back? Also, when we fail over, I'm pretty sure the
cluster will send gratuitous ARPs, why is the client not picking these up and
sending the retransmissions to the other server? The other server obviously
won't know anything about the connection, so it should send back a RST. Why
isn't the client reconnecting when this occurs?

All in all, I think we need to sit down and look at what's actually happening on
the wire and in the client to see why this isn't failing over as quickly as we'd
like. When I get some time, I'll plan to ping Lon and see if I can get access to
his cluster so we can do so.


Comment 8 Peter Staubach 2008-05-27 12:11:02 UTC
A 10 minute timeout seems a little excessive.  If the server hasn't
responded in a minute, then something is going on and the users need
to be informed.

Comment 9 Jeff Layton 2008-05-28 16:38:40 UTC
Chatted with Corey over IRC and he mentioned that he could probably get me
access to a cluster where we can test this. I'll set this to NEEDINFO for him
for now...

Here's what I think should be happening in this situation:

1) client is in the middle of some I/O, NFS service fails over to other cluster
member.

2) server sends gratuitous ARP.

3) client (or a router if they're on different subnets) picks this up and starts
sending packets to "new" server

4) "new" server doesn't know anything about this connection and sends back an RST

5) client reconnects the socket, and starts resending data

...obviously something is falling down and this isn't happening. If everything
on the wire is happening like we think it should and the client is getting back
the RST, then we probably need to ensure that that returns some sort of error
back to the RPC layer so the connection can be re-established and the RPC call
reissued.

I think we'll want to recreate this problem while sniffing traffic on the client
(and maybe on the servers), and also while capturing nfs and rpc debug info. We
should also try to come up with a reproducer that involves minimal actual I/O.


Comment 10 Jeff Layton 2008-05-29 15:43:05 UTC
Looking over this today -- this apparently "normally" works. The hang doesn't
happen every time. It looks like there may be more than one problem here -- we
may need more BZ's:

1) the xiogen/xdoio test seems to occasionally detect data corruption during
failover. We need to let it run for a bit w/o failing over and see if we can
detect the same sort of corruption w/o a failover. This is probably unrelated to
the original problem, but may be a regression from 5.1-ish kernels.

2) sometimes when processes are exiting, they don't seem to be cleaning up their
locks on the server. I still see them in /proc/locks on the server. I don't
think this is what is supposed to happen...

3) I think I've reproduced a hang during failover. What it looks like is that
the client and server are stuck in some sort of ACK positive feedback loop:

1212077402.302324  10.15.89.40 -> 10.15.89.209 TCP [TCP Dup ACK 2#20] 891 > nfs
[ACK] Seq=1681315 Ack=373611122 Win=48971 Len=0 TSV=2920515 TSER=773453651
1212077402.302712 10.15.89.209 -> 10.15.89.40  TCP nfs > 891 [ACK] Seq=0 Ack=0
Win=501 Len=0 TSV=774020121 TSER=2262704
1212077402.303012  10.15.89.40 -> 10.15.89.209 TCP [TCP Dup ACK 2#21] 891 > nfs
[ACK] Seq=1681315 Ack=373611122 Win=48971 Len=0 TSV=2920515 TSER=773453651
1212077402.303314 10.15.89.209 -> 10.15.89.40  TCP nfs > 891 [ACK] Seq=0 Ack=0
Win=501 Len=0 TSV=774020121 TSER=2262704

...the client and server seem to be sending ACKs back and forth as fast as
possible. I cranked up rpc_debug after this started. Just this:

-pid- proc flgs status -client- -prog- --rqstp- -timeout -rpcwait -action- ---ops--
03888 0018 04a0 000000 df525600 100003 d1592000 00060000 xprt_pending e036c580
e0378608

...just 1 RPC call in queue, and no other debug output, so I think it was in the
bowels of the TCP code at that time. Eventually this resolved itself when it hit
a major TCP timeout...debug output from that (which looks normal):

RPC: 3888 running timer
RPC: 3888 xprt_timer
RPC: 3888 __rpc_wake_up_task (now 3377991)
RPC: 3888 disabling timer
RPC: 3888 removed from queue d15570a8 "xprt_pending"
RPC:      __rpc_wake_up_task done
RPC: 3888 sync task resuming
RPC: 3888 call_status (status -110)
RPC: 3888 call_timeout (major)
RPC: 3888 call_bind (status 0)
RPC: 3888 call_connect xprt d1557000 is connected
RPC: 3888 call_transmit (status 0)
RPC: 3888 xprt_prepare_transmit
RPC: 3888 call_encode (status 0)
RPC: 3888 marshaling UNIX cred d3492c40
RPC: 3888 using AUTH_UNIX cred d3492c40 to wrap rpc data
RPC: 3888 xprt_transmit(112)
RPC:      xs_tcp_send_request(112) = 112

...during this time rpciod on the client appeared to be idle:

 =======================
rpciod/0      S 0000026F  2684  2394      7                1208 (L-TLB)
       d157cf88 00000046 5f2c3e9f 0000026f 00000003 d7529160 c14a1c00 0000000a 
       df2c5000 daf47aa0 5f2c57f8 0000026f 00001959 00000000 df2c510c c13f4ee0 
       d34481f0 d34481f4 00000000 00000082 c04331fe d157cfa4 d7529154 00000246 
Call Trace:
 [<c04331fe>] run_workqueue+0x98/0xb5
 [<c0433a6b>] worker_thread+0xb2/0x10b
 [<c042027b>] default_wake_function+0x0/0xc
 [<c04339b9>] worker_thread+0x0/0x10b
 [<c0435ea1>] kthread+0xc0/0xeb
 [<c0435de1>] kthread+0x0/0xeb
 [<c0405c3b>] kernel_thread_helper+0x7/0x10

...and the df command I was running and had hung looked like this:

df            S 00000340  2176  2609   2302                     (NOTLB)
       cff76cf0 00000082 82bacc2c 00000340 d34480e8 00000246 c077d783 00000007 
       dabf3aa0 da85a550 848a2c99 00000340 01cf606d 00000000 dabf3bac c13f4ee0 
       cff76ce0 d1557050 00000000 d15571c4 00000026 cff76d24 c13f0724 00000001 
Call Trace:
 [<e036fa40>] rpc_wait_bit_interruptible+0x1d/0x22 [sunrpc]
 [<c0608643>] __wait_on_bit+0x33/0x58
 [<e036fa23>] rpc_wait_bit_interruptible+0x0/0x22 [sunrpc]
 [<e036fa23>] rpc_wait_bit_interruptible+0x0/0x22 [sunrpc]
 [<c06086ca>] out_of_line_wait_on_bit+0x62/0x6a
 [<c0435f90>] wake_bit_function+0x0/0x3c
 [<e036ff3d>] __rpc_execute+0xde/0x1f8 [sunrpc]
 [<e036b94b>] rpc_call_sync+0x6b/0x91 [sunrpc]
 [<e03f37b9>] nfs3_rpc_wrapper+0x1f/0x5b [nfs]
 [<e03f4da9>] nfs3_proc_statfs+0x54/0x75 [nfs]
...

...the nfsd's on the server seemed to be idle.

I think this might be a problem in the lower-level networking code. We shouldn't
be getting ping-pong ACK storms like this, I don't think.

I'll have to see if I can come up with a way to semi-reliably reproduce this.


Comment 11 Jeff Layton 2008-05-29 16:29:17 UTC
Problem #2 above might be related to this bug:

https://bugzilla.redhat.com/show_bug.cgi?id=448929

We'll have to test Peter's patch and see...

For now, I'm going to use this case to pursue the hangs. I was able to reproduce
an ACK storm during failover. I'm not sure of the exact cause, but I think it
might be related to the ACK storm description in this microsoft (!) technet article:

http://technet.microsoft.com/en-us/magazine/cc160809.aspx

My theory is that somehow when we time a failover just right, we end up in an
ACK storm like this.

Comment 12 Jeff Layton 2008-05-29 16:31:19 UTC
Problem 1 may be related to this bug of Peter's. At the very least, testing a
kernel with his patch may be worthwhile to see if it prevents the problem.

https://bugzilla.redhat.com/show_bug.cgi?id=435291


Comment 13 Jeff Layton 2008-05-29 17:02:14 UTC
Created attachment 307111 [details]
first part of binary capture of ACK storm starting

This is a binary capture I caught while reproducing this. One thing I've
noticed is that when the service fails over, it does not close the sockets
between it and any clients. So when you fail back, the server thinks there's
still an active socket and will accept things like ACKs on it. That may be
related to this problem, though I can't confirm whether having an active socket
is actually necessary for this.

In the capture, we see the server get a FSINFO RPC call and then the client and
server start hammering each other the same resync ACK packets.

There does seem to be a workaround for this problem, when I fail over again, it
seems to become unstuck. Of course, that's not ideal...

Comment 14 Neil Horman 2008-05-29 18:00:14 UTC
Jeff and I were talking about this, and it seems, based on the above trace, this
is a variant of a tcp sequence prediction attack.  Jeff indicated to me that he
recreated the error by repeatedly failing over the service between the two
servers.  Frames 10 and 11 in the trace above show the gratuitous arps which
repoint the client to the 'new' server, which frames 12 and on indicate is
successfully received by the client (it uses the new mac address).  Thats when
the ACK storm starts.  

Our theory is that neither server in the cluster is closing open sockets when
the service is migrated.  This results in, when the service is failed back and
forth repeatedly, then client reuses the same connection again and again on each
server.  We are theorizing that the probblem is that, since the client is
reusing these connections that some amount of data is ACK-ed by the 'old' server
right before we fail over to the 'new' server.  As a result, when the client
picks up where it left off with the new server from the last failover, the
server ACK's the last seen sequence number, while the client is a large number
of bytes ahead of that.  The client has discarded that data, as it was acked by
the 'old' server, and cannot be recovered.  We're getting a larger dump to
confirm this now.

Comment 15 Jeff Layton 2008-05-29 18:24:41 UTC
Created attachment 307118 [details]
failover 2

More complete (and smaller) binary capture of the problem.

Comment 16 Neil Horman 2008-05-29 18:41:22 UTC
Ok, I can confirm that we're seeing what is effectively a sequence predicion
problem here, which arises from the fact that in a failover in which the
'failed' server stays up, it doesn't close tcp sockets that were opened from the
migrating service.  The result is that, on multiple service failovers, an NFS
client using TCP will 'reuse' tcp connections on the server, and it becomes
possible to loose data in the tcp stream that is acked by a failing server right
before service migration.  The packet capture in comment #15 illustrates this:

General notes:
10.15.89.40 is the NFS client
10.15.89.209 is the server
00:1d:09:03:c8:f0 is the MAC address for real server 1 on the cluster (smac1)
001d:09:01:cc:2d is the MAC address for real server 2 on the cluster (smac2)

Frames:

Frame 9: Client establishes initial connection to server at smac1

Frame 21: Server at smac1 ACKs client sequence up to seq #0xedbb53fa. This is
the last frame sent from this connection before failover to server at smac2.

Frame 29: Gratuitous arp sent from cluster to repoint clients to server at smac2

Frame 32: In response to subsequent TCP frame from client, server send RST
frame.  This is good.

Frame 37: NFS layer, detecting reset connection, re-establishes connection to
server at smac2

Frame 43: Server at smac2 sends ACK frame acking client sequence up to seq #
0xedbb6bdd

Frame 50: Gratuitous arp send from cluster to repoint clients to server at smac1

Frame 60: Client starts re-using connection with server at smac1 (established in
frame 9).  Sends a TCP frame to server at smac1 with sequence number 0xedbb6bdd.

Frame 61: Server at smac1 responds with an ack up to seq number 0xedbb53fa
(which was the last sequence number it received and acked in frame 21).

The result here is, that frames sent from the client between frames 21 and 61,
which were acked by the server at smac2 have been acknoweldged, and so are
unavailable at the client.  The client sees that the 'server' has effectively
reneged and is saying that it is missing bytes in the stream that it has already
acked, while the server at smac1 sees only that it has a gap in its stream that
can't be filled.  The result is a infinite retry between the client and the
server.  Backwards sequence numbers are silently ignored in the client (which
IIRC is per the RFC), while all the server at smac1 can do is ACK its last know
sequence number, hoping to fill in the missing data.

Comment 17 Jeff Layton 2008-05-29 19:40:04 UTC
Thanks Neil...

Lon seemed to think that he was able to reproduce hangs on a "first time"
failover. If so, then that's probably a different problem. I suggest that we
concentrate on fixing this for now and then see whether this might be
reproducible afterward...

As far as fixes go, some changes went upstream a while back to add an interface
whereby lockd would drop all locks associated with a particular cluster IP
address. We could possibly hook into that interface and have it also close any
nfsd connections for the same IP, or maybe add a new /proc file that does
something similar. We'll have to look over those patches and see if they're
suitable for RHEL though.

Another possibility might be to have nfsd close all of its non-listening sockets
when it gets a particular signal. That would mean that connections to nfsd that
are not related to the service moving would be closed, but in theory the server
would just send back a RST on the next packet and the connection would be
re-established. Hiccups should be fairly minimal, and would only happen on
failover...


Comment 18 Neil Horman 2008-05-30 13:00:29 UTC
No problem.

If Lon thinks that he can reporduce this on an intital failover (or more
specifically a failover in which the new service hosting the service doesn't
already have a tcp connection established with the client) then we'll need to
see evidence of that, since that will definately be  a different and unique
problem.  The trace above (reproducing with multiple successive failovers) is
rather conclusive in its root cause.

As far as the fix goes, you're idea sounds reasonable to me.  About the only
thing I would worry about is scope.  You're proposed fix sounds like it will
work rather well nfs, but the problem as described above is applicable to any
clustered service that uses tcp.  Any case in which a service migration between
live servers doesn't close sockets on the 'departing' server has exposure to
this bug.  I'm not sure what the best way to handle this is (a case-by-case
basis may be best).



Comment 19 Jeff Layton 2008-05-30 13:15:03 UTC
NFS is a bit of a special case in RHCS. Most userspace services do get shut down
on failover. We don't for NFS since nfsd is somewhat of a "shared resource" and
it's possible (though not common) for someone to have NFS shares that are not
clustered mixed with those that are.

The failover case here is also a bit uncommon -- most actual users of RHCS
wouldn't be failing back and forth this quickly. We *do* need to fix this though
since QA does do this when testing and they're an important "customer".

You're quite correct though that this could be a problem for other TCP-based
services if sockets are not closed on failover. We might also consider writing
something about this for the CS docs so that someone setting up a new clustered
service doesn't get bitten by the same thing.


Comment 20 Lon Hohberger 2008-05-30 13:48:54 UTC
re: Comment #17 - I could be wrong about that, and for now, we'll go with the
assumption that I _am_ wrong.  If for some reason, I turn out to be correct,
we'll address that later.

re: comment #19 - It's not so much mixed clustered+local NFS that I'm worried
about, it's more "two NFS services" which are sharing a node for some reason. 
It also definitely sounds like a case which could potentially be hit on other
services, but I can't come up with any cases.

Generally, services require separate instances of daemons specific to that
service and we don't recommend/test running a single httpd listening on
INADDR_ANY.  Rather, we generally tell people to have per-service instances of
when possible - listening on specific IP addresses - likely for this very reason.

Is there a way to make a kernel nfsd thread that's bound to a specific set of IP
addresses?





Comment 21 Lon Hohberger 2008-05-30 13:52:04 UTC
Whoops, refactor P3 so it's readable:

We don't recommend/test (for example) running a single httpd listening on
INADDR_ANY.  Rather, we tell people to have per-service instances of various
daemons listening on service-assigned IP addresses.

This generally avoids the problem we encountered with nfsd.


Comment 22 Jeff Layton 2008-06-01 20:29:02 UTC
> Is there a way to make a kernel nfsd thread that's bound to a specific set of IP
> addresses?

Unfortunately, no...

Comment 23 Jeff Layton 2008-06-10 19:54:16 UTC
Posted a question about this upstream with a rough outline of a proposal to fix
it. It generated quite a lot of interest! Some high points:

1) we should try to make a single interface that we call into for failover. The
"unlock_ip" stuff went upstream recently that can be used to just unlock locks
that have come in via a particular IP address. That should be extended (and
maybe renamed) to close sockets as well. We may need it to do other things.

2) simply closing the socket probably won't be sufficient. A possible scenario:

We get a non-idempotent request (say, a rename or unlink). The server gets the
call, and does the change on the local filesystem. But, before we can send the
reply back, the sockets are closed, and the local address is torn down so no
reply is sent.

The other server takes over the filesystem and IP address, etc. The client
retransmits the request after reestablishing the connection, but now the request
fails. The file has already been renamed or unlinked, etc...

Truthfully, we may already be vulnerable to this problem. It's not clear to me
that we can ever eliminate this race 100%. It's possible for the server to crash
or lose network connectivity after getting and processing a request but before
sending the reply.


Comment 24 Jeff Layton 2008-06-10 20:05:34 UTC
I think the simplest thing to do here is to just stop nfsd prior to taking down
the IP address, and then to start it up again after the address is gone. That
should make sure that all requests are flushed out.

Clients should transparently recover if they happen to hit the server after nfsd
has gone down, but before the address has been removed.

It's not an ideal solution, but it's entirely doable from userspace. I don't see
much in the way of advantages to trying to do something special in kernel for
this. Perhaps if we want to take advantage of unlock_ip then we'll want to think
about this differently.

So to roughly hash this out, on failover:

On "old" server:
pkill -2 nfsd
remove ip address from interface
start up nfsd again
pkill -9 lockd

...the "new" server should behave more or less as it does today.


Comment 26 RHEL Program Management 2008-06-19 14:20:53 UTC
Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request. 

Comment 27 Jeff Layton 2008-09-13 19:08:38 UTC
Reassigning this back to Lon. I think we understand at least this problem now, but I don't see upstream getting any sort of interface to close open sockets here. I think the best thing to do is something like I explained in comment #24. Alternately we could consider a blurb in the docs about this problem and just tell people not to fail over and back rapidly to avoid it...

Comment 28 Lon Hohberger 2008-12-09 21:38:48 UTC
There are two cases:

(a) 2+ NFS services are running on a cluster node.  Stopping one (incl. killing nfsd) will affect the other - clients of the *other* service will suddenly get ESTALE.  To prevent ESTALE on clients of the *other* NFS service, we will need to:

 * stop all NFS services
 * kill lockd / nfsd / etc.
 * start lockd / nfsd / etc.
 * start all but the NFS service we wanted to stop

This solution is sub optimal at best (clients lose access temporarily), and destabilizing at worst (this will require a pretty decent design to fix).  I might have a way to fix it, but it's not well tested and not in RHEL5.

One roundabout way to solve this is to present LUNs to virtual machines and store the IP address, NFS export list, and mount points all within the virtual machine and export the file system(s) from that virtual machine instead of a cluster service.  Stopping/starting virtual machines should not suffer any of these interaction problems.


Now, in the case I'm wrong about ESTALE (clients will just retry), or in the case of one service on a cluster node, read on.


(b) 1 NFS service on the cluster, so no worries about stomping on another NFS service or its clients.  Stopping the service doesn't affect other NFS services (because there are none), so it's always clean to stop nfsd.  This can be solved by adding /etc/init.d/nfs as a child of the first file system export, a la:

  <service name="myNFS">
    <fs name="my_filesystem" device=... >
      <script name="nfs_script" file="/etc/init.d/nfs"/>
      <nfsexport name="export_list_1" >
        <nfsclient .../>
        <nfsclient .../>
        <nfsclient .../>
        <nfsclient .../>
      </nfsexport>
    </fs>
    <fs ... >
      <nfsexport name="export_list_1" >
        <nfsclient .../>
        ...
      </nfsexport>
    </fs>
    <ip address="10.1.1.2" />
  </service>

For (b) I had also preemptively written a complete <nfsserver> resource which also manages NSM lists and what-not.  However, this resource is not part of RHEL5.

Comment 30 RHEL Program Management 2009-02-16 15:43:45 UTC
Updating PM score.

Comment 31 Lon Hohberger 2009-03-31 21:04:33 UTC
Because the upstream patch was largely rejected

Option (a) -> wontfix; destabilizing

Option (b) -> wontfix; destabilizing.  Workaround: As an alternative, a user could use /etc/exports and /etc/init.d/nfs as a <script> instead of the cluster's NFS resource agents.

Another workaround: Use UDP.

I agree with Jeff -- this should probably be noted in the documentation somewhere, or at least a Knowledge Base article.