Description of problem: Lon and I were noticing that nfs client I/O (when mounted w/ tcp) can deadlock for 1 to 15 minutes after a service relocation has taken place. When mounted with udp, the I/O pause is never more then 10 seconds or so. 0 include/linux/nfs_fs.h <global> 22 #define NFS_MAX_TCP_TIMEOUT (600*HZ) 0 include/linux/nfs_fs.h <global> 21 #define NFS_MAX_UDP_TIMEOUT (60*HZ) This is what the config looks like: [root@hayes-01 ~]# clustat Member Status: Quorate Member Name ID Status ------ ---- ---- ------ hayes-01 1 Online, Local, rgmanager hayes-02 2 Online, rgmanager hayes-03 3 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:EXT1 hayes-02 started service:EXT2 hayes-01 started <rm> <failoverdomains> <failoverdomain name="ALL" ordered="0" restricted="0"> <failoverdomainnode name="hayes-02" priority="1"/> <failoverdomainnode name="hayes-03" priority="1"/> <failoverdomainnode name="hayes-01" priority="1"/> </failoverdomain> </failoverdomains> <resources> <ip address="10.15.89.209" monitor_link="1"/> <ip address="10.15.89.210" monitor_link="1"/> <fs device="/dev/hayes/aoe1" force_unmount="1" fstype="ext3" mountpoint="/mnt/ext1" name="ext1" options=""/> <fs device="/dev/hayes/aoe2" force_unmount="1" fstype="ext3" mountpoint="/mnt/ext2" name="ext2" options=""/> <nfsexport name="nfs exports"/> <nfsclient name="all" options="rw" target="*"/> </resources> <service autostart="1" domain="ALL" name="EXT1"> <fs ref="ext1"> <nfsexport ref="nfs exports"> <nfsclient ref="all"/> </nfsexport> </fs> <ip ref="10.15.89.209"/> </service> <service autostart="1" domain="ALL" name="EXT2"> <fs ref="ext2"> <nfsexport ref="nfs exports"> <nfsclient ref="all"/> </nfsexport> </fs> <ip ref="10.15.89.210"/> </service> </rm> Version-Release number of selected component (if applicable): 2.6.18-53.el5 rgmanager-2.0.31-1.el5
This is a kernel bug. The TCP NFS client in RHEL5 is supposed to retry. Maybe it didn't flush the ARP cache or something and was trying to send to the old hardware address?
This doesn't look specific to rgmanager, and it's on the *client* side anyway. Basically, sometimes recovery (e.g. I/o being resumed on the client) happens quickly (a few seconds). While at other times, it takes 10-15 minutes to resume. Note that this problem seems to be specific to NFS over TCP; NFS over UDP routinely recovers quickly.
In linux/include/linux/net_fs.h: #define NFS_MAX_TCP_TIMEOUT (600*HZ) I wonder if this is what's causing our problem...
In recent 5.1.Z testing I've seen hangs that last for days with tcp mounts (which are the default).
Rescuing this from the fedora kernel queue... Yes, we do crank up the RPC timeouts on TCP since it's desirable to allow the TCP layer to handle retransmissions rather than having the RPC layer do it. This really sounds like something around the TCP or IP layer -- it might be a problem with ARP timeouts, but if Nate's comment #4 is correct, then maybe there's something else going on. Some network captures from the client when this is happening might be interesting -- maybe see if we can see whether packets are leaving the client box, and where they're being sent.
The fundamental problem is that a lot of people misunderstand what it means for TCP to be reliable: it is a reliable data protocol, but not a reliable session protocol. That is, if a data packet is lost, the TCP stacks will retransmit; but if the session is lost, the TCP stacks will not attempt to reconnect. The NFS client for TCP is written assuming that TCP is fully reliable, and will not attempt to reconnect if the session dies, but that for UDP assumes that UDP is fully unreliable, and will retry.
All true, but hanging for tens of minutes is decidedly non-optimal. I have to wonder if there might be something we can do to help the client speed along reconnecting in this situation. The first thing we need to do is verify what the client is actually doing during these 10-15 minutes. The TCP layer will be resending frames, but does the RPC layer ever get an error back? Also, when we fail over, I'm pretty sure the cluster will send gratuitous ARPs, why is the client not picking these up and sending the retransmissions to the other server? The other server obviously won't know anything about the connection, so it should send back a RST. Why isn't the client reconnecting when this occurs? All in all, I think we need to sit down and look at what's actually happening on the wire and in the client to see why this isn't failing over as quickly as we'd like. When I get some time, I'll plan to ping Lon and see if I can get access to his cluster so we can do so.
A 10 minute timeout seems a little excessive. If the server hasn't responded in a minute, then something is going on and the users need to be informed.
Chatted with Corey over IRC and he mentioned that he could probably get me access to a cluster where we can test this. I'll set this to NEEDINFO for him for now... Here's what I think should be happening in this situation: 1) client is in the middle of some I/O, NFS service fails over to other cluster member. 2) server sends gratuitous ARP. 3) client (or a router if they're on different subnets) picks this up and starts sending packets to "new" server 4) "new" server doesn't know anything about this connection and sends back an RST 5) client reconnects the socket, and starts resending data ...obviously something is falling down and this isn't happening. If everything on the wire is happening like we think it should and the client is getting back the RST, then we probably need to ensure that that returns some sort of error back to the RPC layer so the connection can be re-established and the RPC call reissued. I think we'll want to recreate this problem while sniffing traffic on the client (and maybe on the servers), and also while capturing nfs and rpc debug info. We should also try to come up with a reproducer that involves minimal actual I/O.
Looking over this today -- this apparently "normally" works. The hang doesn't happen every time. It looks like there may be more than one problem here -- we may need more BZ's: 1) the xiogen/xdoio test seems to occasionally detect data corruption during failover. We need to let it run for a bit w/o failing over and see if we can detect the same sort of corruption w/o a failover. This is probably unrelated to the original problem, but may be a regression from 5.1-ish kernels. 2) sometimes when processes are exiting, they don't seem to be cleaning up their locks on the server. I still see them in /proc/locks on the server. I don't think this is what is supposed to happen... 3) I think I've reproduced a hang during failover. What it looks like is that the client and server are stuck in some sort of ACK positive feedback loop: 1212077402.302324 10.15.89.40 -> 10.15.89.209 TCP [TCP Dup ACK 2#20] 891 > nfs [ACK] Seq=1681315 Ack=373611122 Win=48971 Len=0 TSV=2920515 TSER=773453651 1212077402.302712 10.15.89.209 -> 10.15.89.40 TCP nfs > 891 [ACK] Seq=0 Ack=0 Win=501 Len=0 TSV=774020121 TSER=2262704 1212077402.303012 10.15.89.40 -> 10.15.89.209 TCP [TCP Dup ACK 2#21] 891 > nfs [ACK] Seq=1681315 Ack=373611122 Win=48971 Len=0 TSV=2920515 TSER=773453651 1212077402.303314 10.15.89.209 -> 10.15.89.40 TCP nfs > 891 [ACK] Seq=0 Ack=0 Win=501 Len=0 TSV=774020121 TSER=2262704 ...the client and server seem to be sending ACKs back and forth as fast as possible. I cranked up rpc_debug after this started. Just this: -pid- proc flgs status -client- -prog- --rqstp- -timeout -rpcwait -action- ---ops-- 03888 0018 04a0 000000 df525600 100003 d1592000 00060000 xprt_pending e036c580 e0378608 ...just 1 RPC call in queue, and no other debug output, so I think it was in the bowels of the TCP code at that time. Eventually this resolved itself when it hit a major TCP timeout...debug output from that (which looks normal): RPC: 3888 running timer RPC: 3888 xprt_timer RPC: 3888 __rpc_wake_up_task (now 3377991) RPC: 3888 disabling timer RPC: 3888 removed from queue d15570a8 "xprt_pending" RPC: __rpc_wake_up_task done RPC: 3888 sync task resuming RPC: 3888 call_status (status -110) RPC: 3888 call_timeout (major) RPC: 3888 call_bind (status 0) RPC: 3888 call_connect xprt d1557000 is connected RPC: 3888 call_transmit (status 0) RPC: 3888 xprt_prepare_transmit RPC: 3888 call_encode (status 0) RPC: 3888 marshaling UNIX cred d3492c40 RPC: 3888 using AUTH_UNIX cred d3492c40 to wrap rpc data RPC: 3888 xprt_transmit(112) RPC: xs_tcp_send_request(112) = 112 ...during this time rpciod on the client appeared to be idle: ======================= rpciod/0 S 0000026F 2684 2394 7 1208 (L-TLB) d157cf88 00000046 5f2c3e9f 0000026f 00000003 d7529160 c14a1c00 0000000a df2c5000 daf47aa0 5f2c57f8 0000026f 00001959 00000000 df2c510c c13f4ee0 d34481f0 d34481f4 00000000 00000082 c04331fe d157cfa4 d7529154 00000246 Call Trace: [<c04331fe>] run_workqueue+0x98/0xb5 [<c0433a6b>] worker_thread+0xb2/0x10b [<c042027b>] default_wake_function+0x0/0xc [<c04339b9>] worker_thread+0x0/0x10b [<c0435ea1>] kthread+0xc0/0xeb [<c0435de1>] kthread+0x0/0xeb [<c0405c3b>] kernel_thread_helper+0x7/0x10 ...and the df command I was running and had hung looked like this: df S 00000340 2176 2609 2302 (NOTLB) cff76cf0 00000082 82bacc2c 00000340 d34480e8 00000246 c077d783 00000007 dabf3aa0 da85a550 848a2c99 00000340 01cf606d 00000000 dabf3bac c13f4ee0 cff76ce0 d1557050 00000000 d15571c4 00000026 cff76d24 c13f0724 00000001 Call Trace: [<e036fa40>] rpc_wait_bit_interruptible+0x1d/0x22 [sunrpc] [<c0608643>] __wait_on_bit+0x33/0x58 [<e036fa23>] rpc_wait_bit_interruptible+0x0/0x22 [sunrpc] [<e036fa23>] rpc_wait_bit_interruptible+0x0/0x22 [sunrpc] [<c06086ca>] out_of_line_wait_on_bit+0x62/0x6a [<c0435f90>] wake_bit_function+0x0/0x3c [<e036ff3d>] __rpc_execute+0xde/0x1f8 [sunrpc] [<e036b94b>] rpc_call_sync+0x6b/0x91 [sunrpc] [<e03f37b9>] nfs3_rpc_wrapper+0x1f/0x5b [nfs] [<e03f4da9>] nfs3_proc_statfs+0x54/0x75 [nfs] ... ...the nfsd's on the server seemed to be idle. I think this might be a problem in the lower-level networking code. We shouldn't be getting ping-pong ACK storms like this, I don't think. I'll have to see if I can come up with a way to semi-reliably reproduce this.
Problem #2 above might be related to this bug: https://bugzilla.redhat.com/show_bug.cgi?id=448929 We'll have to test Peter's patch and see... For now, I'm going to use this case to pursue the hangs. I was able to reproduce an ACK storm during failover. I'm not sure of the exact cause, but I think it might be related to the ACK storm description in this microsoft (!) technet article: http://technet.microsoft.com/en-us/magazine/cc160809.aspx My theory is that somehow when we time a failover just right, we end up in an ACK storm like this.
Problem 1 may be related to this bug of Peter's. At the very least, testing a kernel with his patch may be worthwhile to see if it prevents the problem. https://bugzilla.redhat.com/show_bug.cgi?id=435291
Created attachment 307111 [details] first part of binary capture of ACK storm starting This is a binary capture I caught while reproducing this. One thing I've noticed is that when the service fails over, it does not close the sockets between it and any clients. So when you fail back, the server thinks there's still an active socket and will accept things like ACKs on it. That may be related to this problem, though I can't confirm whether having an active socket is actually necessary for this. In the capture, we see the server get a FSINFO RPC call and then the client and server start hammering each other the same resync ACK packets. There does seem to be a workaround for this problem, when I fail over again, it seems to become unstuck. Of course, that's not ideal...
Jeff and I were talking about this, and it seems, based on the above trace, this is a variant of a tcp sequence prediction attack. Jeff indicated to me that he recreated the error by repeatedly failing over the service between the two servers. Frames 10 and 11 in the trace above show the gratuitous arps which repoint the client to the 'new' server, which frames 12 and on indicate is successfully received by the client (it uses the new mac address). Thats when the ACK storm starts. Our theory is that neither server in the cluster is closing open sockets when the service is migrated. This results in, when the service is failed back and forth repeatedly, then client reuses the same connection again and again on each server. We are theorizing that the probblem is that, since the client is reusing these connections that some amount of data is ACK-ed by the 'old' server right before we fail over to the 'new' server. As a result, when the client picks up where it left off with the new server from the last failover, the server ACK's the last seen sequence number, while the client is a large number of bytes ahead of that. The client has discarded that data, as it was acked by the 'old' server, and cannot be recovered. We're getting a larger dump to confirm this now.
Created attachment 307118 [details] failover 2 More complete (and smaller) binary capture of the problem.
Ok, I can confirm that we're seeing what is effectively a sequence predicion problem here, which arises from the fact that in a failover in which the 'failed' server stays up, it doesn't close tcp sockets that were opened from the migrating service. The result is that, on multiple service failovers, an NFS client using TCP will 'reuse' tcp connections on the server, and it becomes possible to loose data in the tcp stream that is acked by a failing server right before service migration. The packet capture in comment #15 illustrates this: General notes: 10.15.89.40 is the NFS client 10.15.89.209 is the server 00:1d:09:03:c8:f0 is the MAC address for real server 1 on the cluster (smac1) 001d:09:01:cc:2d is the MAC address for real server 2 on the cluster (smac2) Frames: Frame 9: Client establishes initial connection to server at smac1 Frame 21: Server at smac1 ACKs client sequence up to seq #0xedbb53fa. This is the last frame sent from this connection before failover to server at smac2. Frame 29: Gratuitous arp sent from cluster to repoint clients to server at smac2 Frame 32: In response to subsequent TCP frame from client, server send RST frame. This is good. Frame 37: NFS layer, detecting reset connection, re-establishes connection to server at smac2 Frame 43: Server at smac2 sends ACK frame acking client sequence up to seq # 0xedbb6bdd Frame 50: Gratuitous arp send from cluster to repoint clients to server at smac1 Frame 60: Client starts re-using connection with server at smac1 (established in frame 9). Sends a TCP frame to server at smac1 with sequence number 0xedbb6bdd. Frame 61: Server at smac1 responds with an ack up to seq number 0xedbb53fa (which was the last sequence number it received and acked in frame 21). The result here is, that frames sent from the client between frames 21 and 61, which were acked by the server at smac2 have been acknoweldged, and so are unavailable at the client. The client sees that the 'server' has effectively reneged and is saying that it is missing bytes in the stream that it has already acked, while the server at smac1 sees only that it has a gap in its stream that can't be filled. The result is a infinite retry between the client and the server. Backwards sequence numbers are silently ignored in the client (which IIRC is per the RFC), while all the server at smac1 can do is ACK its last know sequence number, hoping to fill in the missing data.
Thanks Neil... Lon seemed to think that he was able to reproduce hangs on a "first time" failover. If so, then that's probably a different problem. I suggest that we concentrate on fixing this for now and then see whether this might be reproducible afterward... As far as fixes go, some changes went upstream a while back to add an interface whereby lockd would drop all locks associated with a particular cluster IP address. We could possibly hook into that interface and have it also close any nfsd connections for the same IP, or maybe add a new /proc file that does something similar. We'll have to look over those patches and see if they're suitable for RHEL though. Another possibility might be to have nfsd close all of its non-listening sockets when it gets a particular signal. That would mean that connections to nfsd that are not related to the service moving would be closed, but in theory the server would just send back a RST on the next packet and the connection would be re-established. Hiccups should be fairly minimal, and would only happen on failover...
No problem. If Lon thinks that he can reporduce this on an intital failover (or more specifically a failover in which the new service hosting the service doesn't already have a tcp connection established with the client) then we'll need to see evidence of that, since that will definately be a different and unique problem. The trace above (reproducing with multiple successive failovers) is rather conclusive in its root cause. As far as the fix goes, you're idea sounds reasonable to me. About the only thing I would worry about is scope. You're proposed fix sounds like it will work rather well nfs, but the problem as described above is applicable to any clustered service that uses tcp. Any case in which a service migration between live servers doesn't close sockets on the 'departing' server has exposure to this bug. I'm not sure what the best way to handle this is (a case-by-case basis may be best).
NFS is a bit of a special case in RHCS. Most userspace services do get shut down on failover. We don't for NFS since nfsd is somewhat of a "shared resource" and it's possible (though not common) for someone to have NFS shares that are not clustered mixed with those that are. The failover case here is also a bit uncommon -- most actual users of RHCS wouldn't be failing back and forth this quickly. We *do* need to fix this though since QA does do this when testing and they're an important "customer". You're quite correct though that this could be a problem for other TCP-based services if sockets are not closed on failover. We might also consider writing something about this for the CS docs so that someone setting up a new clustered service doesn't get bitten by the same thing.
re: Comment #17 - I could be wrong about that, and for now, we'll go with the assumption that I _am_ wrong. If for some reason, I turn out to be correct, we'll address that later. re: comment #19 - It's not so much mixed clustered+local NFS that I'm worried about, it's more "two NFS services" which are sharing a node for some reason. It also definitely sounds like a case which could potentially be hit on other services, but I can't come up with any cases. Generally, services require separate instances of daemons specific to that service and we don't recommend/test running a single httpd listening on INADDR_ANY. Rather, we generally tell people to have per-service instances of when possible - listening on specific IP addresses - likely for this very reason. Is there a way to make a kernel nfsd thread that's bound to a specific set of IP addresses?
Whoops, refactor P3 so it's readable: We don't recommend/test (for example) running a single httpd listening on INADDR_ANY. Rather, we tell people to have per-service instances of various daemons listening on service-assigned IP addresses. This generally avoids the problem we encountered with nfsd.
> Is there a way to make a kernel nfsd thread that's bound to a specific set of IP > addresses? Unfortunately, no...
Posted a question about this upstream with a rough outline of a proposal to fix it. It generated quite a lot of interest! Some high points: 1) we should try to make a single interface that we call into for failover. The "unlock_ip" stuff went upstream recently that can be used to just unlock locks that have come in via a particular IP address. That should be extended (and maybe renamed) to close sockets as well. We may need it to do other things. 2) simply closing the socket probably won't be sufficient. A possible scenario: We get a non-idempotent request (say, a rename or unlink). The server gets the call, and does the change on the local filesystem. But, before we can send the reply back, the sockets are closed, and the local address is torn down so no reply is sent. The other server takes over the filesystem and IP address, etc. The client retransmits the request after reestablishing the connection, but now the request fails. The file has already been renamed or unlinked, etc... Truthfully, we may already be vulnerable to this problem. It's not clear to me that we can ever eliminate this race 100%. It's possible for the server to crash or lose network connectivity after getting and processing a request but before sending the reply.
I think the simplest thing to do here is to just stop nfsd prior to taking down the IP address, and then to start it up again after the address is gone. That should make sure that all requests are flushed out. Clients should transparently recover if they happen to hit the server after nfsd has gone down, but before the address has been removed. It's not an ideal solution, but it's entirely doable from userspace. I don't see much in the way of advantages to trying to do something special in kernel for this. Perhaps if we want to take advantage of unlock_ip then we'll want to think about this differently. So to roughly hash this out, on failover: On "old" server: pkill -2 nfsd remove ip address from interface start up nfsd again pkill -9 lockd ...the "new" server should behave more or less as it does today.
Development Management has reviewed and declined this request. You may appeal this decision by reopening this request.
Reassigning this back to Lon. I think we understand at least this problem now, but I don't see upstream getting any sort of interface to close open sockets here. I think the best thing to do is something like I explained in comment #24. Alternately we could consider a blurb in the docs about this problem and just tell people not to fail over and back rapidly to avoid it...
There are two cases: (a) 2+ NFS services are running on a cluster node. Stopping one (incl. killing nfsd) will affect the other - clients of the *other* service will suddenly get ESTALE. To prevent ESTALE on clients of the *other* NFS service, we will need to: * stop all NFS services * kill lockd / nfsd / etc. * start lockd / nfsd / etc. * start all but the NFS service we wanted to stop This solution is sub optimal at best (clients lose access temporarily), and destabilizing at worst (this will require a pretty decent design to fix). I might have a way to fix it, but it's not well tested and not in RHEL5. One roundabout way to solve this is to present LUNs to virtual machines and store the IP address, NFS export list, and mount points all within the virtual machine and export the file system(s) from that virtual machine instead of a cluster service. Stopping/starting virtual machines should not suffer any of these interaction problems. Now, in the case I'm wrong about ESTALE (clients will just retry), or in the case of one service on a cluster node, read on. (b) 1 NFS service on the cluster, so no worries about stomping on another NFS service or its clients. Stopping the service doesn't affect other NFS services (because there are none), so it's always clean to stop nfsd. This can be solved by adding /etc/init.d/nfs as a child of the first file system export, a la: <service name="myNFS"> <fs name="my_filesystem" device=... > <script name="nfs_script" file="/etc/init.d/nfs"/> <nfsexport name="export_list_1" > <nfsclient .../> <nfsclient .../> <nfsclient .../> <nfsclient .../> </nfsexport> </fs> <fs ... > <nfsexport name="export_list_1" > <nfsclient .../> ... </nfsexport> </fs> <ip address="10.1.1.2" /> </service> For (b) I had also preemptively written a complete <nfsserver> resource which also manages NSM lists and what-not. However, this resource is not part of RHEL5.
Updating PM score.
Because the upstream patch was largely rejected Option (a) -> wontfix; destabilizing Option (b) -> wontfix; destabilizing. Workaround: As an alternative, a user could use /etc/exports and /etc/init.d/nfs as a <script> instead of the cluster's NFS resource agents. Another workaround: Use UDP. I agree with Jeff -- this should probably be noted in the documentation somewhere, or at least a Knowledge Base article.