Bug 132823 (RHEL4NFSFailover)
Summary: | RHEL4 U1: NFS cluster failover (hw) | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Tim Burke <tburke> | ||||||
Component: | kernel | Assignee: | Peter Staubach <staubach> | ||||||
Status: | CLOSED WONTFIX | QA Contact: | Corey Marthaler <cmarthal> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 6.0 | CC: | alexander.samad, bstevens, henry.harris, hgarcia, jbacik, jbaron, jlayton, jplans, jscalia, kanderson, k.georgiou, lhh, mstadtle, notting, nstrug, sghosh, staubach, steved | ||||||
Target Milestone: | rc | Keywords: | FutureFeature | ||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | Kernel | ||||||||
Fixed In Version: | Doc Type: | Enhancement | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2009-09-30 17:42:44 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 166458, 166701, 167571, 167572, 175215, 175229 | ||||||||
Bug Blocks: | 139847, 180185, 430698 | ||||||||
Attachments: |
|
Description
Tim Burke
2004-09-17 14:38:16 UTC
Created attachment 107780 [details]
NFS failover unit test script (requires distributed ssh keys)
Don't know if it helps, but here's the unit test I used for:
(1) Basic NFS mount/umount test of cluster NFS export
(2) Normal I/O to clustered NFS export
(3) Normal I/O during restart of cluster NFS export
(4) Normal I/O during relocate of cluster NFS export
(5) Nomral I/O during failover of cluster NFS export (by rebooting the active
node)
You have to customize all the variables at the top of the script and distribute
ssh keys (from the client to the servers). I can help carve out a cluster
if/when necessary
Adding all the currently open bugzillas related to NFS behavior during/after relocation/failover on RHEL4 *** Bug 166543 has been marked as a duplicate of this bug. *** Adding alias. Based on things I tested earlier today, RHEL4 may not need rmtab maintenance at all; in GFS or non-GFS exports. We're looking in to it more. * RHEL4 does not need rmtab maintenance like RHEL3. * rgmanager should have inheritable fsid tags to make configuration easier (rather than having to do fsid=x for every NFS client ... :( ) * Added relevant fsid inheritance bugzilla to this for tracking. test-script-blue performs 5 basic sanity tests with both TCP & UDP for nfsv3 exports in Red Hat Cluster Suite. * mount, umount * mount, perfom some I/O, unmount * mount, restart service on same node while doing I/O, unmount * mount, relocate service to different node while doing I/O, unmount * mount, obliterate server running service (reboot -fn), causing a hard failover, unmount When the I/O are running, it should not return an I/O error, ESTALE, EBUSY, EPERM, etc... Here's a test matrix running it on a RHCS4 cluster with GFS and ext3-based NFS exports. Either it passes all 5 tests as noted above, or it fails. Note that the sleep-hack (a hack which lets nfsd hopefully clear its queue of pending requests by sleeping for 10 seconds) is present. RHEL3 RHEL4 ext3 export Pass Pass gfs export Pass Pass In all cases, the client took a long time (5-15 minutes to recover). AFAIK, this is not a cluster problem; the retry time increases with each successful failure with NFS. When using RHEL4 as a client, the CPU got pegged with ksoftirqd after about 4-5 minutes - it seems like when nfs is in retry-loop it enters some infinite loop condition after that time. What is weird is that the NFS service has failed over and is mountable by other clients long before the waiting client enters this weird state. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3 root 39 19 0 0 0 S 99.9 0.0 12:37.34 ksoftirqd/0 After about 10-11 minutes of being CPU-pegged, the NFS client recovers and life goes on. This behavior does not happen with a RHEL3 client. This could be tunable; I don't know, but CPU-pegged seems bad... Other thing to note: the *failover* test is only done using TCP, and there's a 15-20 second failover time, and this test is the only one which takes a really long time (>5 minutes) to complete. I could switch to udp, but I think we care most about tcp. Steve can make that call, in any case. Could you please post a sysrq-t system backtrace and sysrq-p backtrace of the cpus... when the CPU get pegged... With the >5 min delay a sysrq-t would be good and a ethereal dump to see of there is any NFS traffic going over the wire would also be good... the section in question looks like this: Section "Device" Identifier "Videocard0" Driver "i810" VendorName "Videocard vendor" BoardName "Intel 810" EndSection ignore last comment, wrong bz id Steve -- I'll get you the stack trace today. Created attachment 122341 [details]
messages / sysrq-t output
Ok, I have a tcpdump, but it's like 400MB, so I'm not going to attach it. I started the tcpdump as soon as my service-relocate completed, and hit ctrl-C on my application so that it would stop as soon as it came out of disk-wait: Fri Dec 16 13:45:32 EST 2005 13:45:32.336938 IP red.lab.boston.redhat.com.799 > 192.168.79.11.nfs: . ack 2110919727 win 1728 <nop,nop,timestamp 848259206 78019328> 13:45:32.338870 IP 192.168.79.11.nfs > red.lab.boston.redhat.com.799: . ack 4166286499 win 16022 <nop,nop,timestamp 270554560 847653134> 13:45:32.338899 IP red.lab.boston.redhat.com.799 > 192.168.79.11.nfs: . ack 1 win 1728 <nop,nop,timestamp 848259206 78019328> 13:45:32.338919 IP 192.168.79.11.nfs > red.lab.boston.redhat.com.799: . ack 4166286499 win 16022 <nop,nop,timestamp 270554560 847653134> 13:45:32.338938 IP red.lab.boston.redhat.com.799 > 192.168.79.11.nfs: . ack 1 win 1728 <nop,nop,timestamp 848259206 78019328> ... It recovered at this point: 14:02:32.529162 IP red.lab.boston.redhat.com.8092349 > 192.168.79.11.nfs: 1448 write [|nfs] Could these be some of the dup-ack problems? Hi I was wondering if there has been any resolution to this ? Thanks To provide minimum nfs failover functionality with NFS V2/V3, following are the tentative (base kernel) work item list: A-1: NFSD request reply cache gets copied from taken-over server into new server upon failover. Cache size = 1024 * struct svc_cacherep (64K maximum). (both upstream and RHEL kernels). A-2: Allow RPC layer to close (TCP) socket connection without going into TIME_WAIT state. (RHEL kernel only). A-3: Allow umounting a filesystem (via kernel force umount call) regardless of open file references count. This is to immune failover from the forever possible kernel and/or filesystem bugs that somehow leave file reference count dangling around. This feature (todo item) is mentioned in linux 2006 kernel summit (http://lwn.net/Articles/191926/). (both upstream and RHEL kernels). RHEL 4.4 NFS failover restrictions ---------------------------------- B-1: Unless NFS client applications can tolerate ESTALE and/or EPERM errors, IO activities on the failover ip interface must be temporarily quiesced until active-active failover transition completes. This is to avoid non-idempotent NFS operation failure on the new server. (check out "Why NFS Sucks" by Olaf Kirch, placed as "kirch-reprint.pdf" in 2006 OLS proceeding from https://ols2006.108.redhat.com/). B-2: With various possible base kernel bugs outside RHCS' control, there are possibilities that local filesystem (such as ext3) umount could fail. To ensure data integrity, RHCS will abort the failover. Admin could either specify the self-fence (reboot taken-over server) option to force failover (via cluster.conf file) or re-mount the filesystem on the taken-over server as ro (read-only) to allow failover. Both options have the possibility of losing data. (side note: not sure whether we could do re-mount as read-only in user space - need to check). This restriction doesn't apply to GFS cluster filesystem. B-3: If nfs client invokes NLM locking call, the subject nfs servers (both take-over and take-over) will enter a global 90-second (tunable) locking grace period for every nfs service on the servers. B-4: If NFS-TCP is involved, failover should not be issued on the same pair of machines multiple times within 30-minute period; for example, failing over from node A to B, then immediately failing from B back to A would hang the connection. This is to avoid TCP TIME_WAIT issue. RHEL 4.5 and/or RHEL 5.1 Improvement ------------------------------------- C-1: Implement A-1 but subject to upstream acceptance. C-2: Improve forced ro-remount (B-1) so it is less likely to lose data.Also subject to upstream acceptance. C-3: B-3 issue - subject to the acceptance of the patches submitted in: https://www.redhat.com/archives/cluster-devel/2006-August/msg00000.html. C-4: A-3 issue - backport the upstream changes into RHEL with restrictions (may not work well if NFS client and servers are across firewall). Long Term Items: --------------- D-1: Implement A-3 to lift most of the failover issues. D-2: Note that NFS RPC reply cache is not fool-proof. May need to study filesystem specific (particularly GFS) feature (such as piggy-back state information into NFS filehandle) to allow error free failover. *** Bug 178057 has been marked as a duplicate of this bug. *** Status Summary: With current state of linux kernels (both RHEL and upstream), NFS failovers have been error-prone. An example is the "non-idempotent" issues associated with operations such as "rm" or "rename". This could start with (nfs) client sending in a file remove request ("rm"). The (nfs) server passes the call into filesystem and somehow gets stuck there for a while (say waiting for a directory lock). Timeout occurs in client side and re-transmit happens. Current linux NFSD is designed to handle this issue by a global "request reply cache" where each request is checked with entries in the cache. If a duplication is ever found, the re-transmitted request is subsequently dropped. However, in a failover scenario, if the back-up server gets the re-transmitted request, and if there is no equivialent entry in its own reply cache to protect this duplication, the client may end up getting various errors (ESTALE, EPERM, etc) return code, regardless the request has been carried out and succeeded in the original server. To ease these inherited NFS protocol issues, we've tentatively identified several work items that we plan to address in next few RHEL updates and/or releases. However, we've flagged the planned changes as private events in this bugzilla to avoid false expectations, mostly because the fixes are subject to the acceptance of upstrem linux kernel community and RHEL overall product directions. Before we complete the work, for NFS v2/V3, RHEL 4.4 has the following restrictions: B-1: Unless NFS client applications can tolerate ESTALE and/or EPERM errors, IO activities on the failover ip interface must be temporarily quiesced until active-active failover transition completes. This is to avoid non-idempotent NFS operation failure on the new server. (check out "Why NFS Sucks" by Olaf Kirch, placed as "kirch-reprint.pdf" in 2006 OLS proceeding). B-2: With various possible base kernel bugs outside RHCS' control, there are possibilities that local filesystem (such as ext3) umount could fail. To ensure data integrity, RHCS will abort the failover. Admin could specify the self-fence (reboot taken-over server) option to force failover (via cluster.conf file). B-3: If nfs client invokes NLM locking call, the subject nfs servers (both taken-over and take-over) will enter a global 90-second (tunable) locking grace period for every nfs service on the servers. B-4: If NFS-TCP is involved, failover should not be issued on the same pair of machines multiple times within 30-minute period; for example, failing over from node A to B, then immediately failing from B back to A would hang the connection. This is to avoid TCP TIME_WAIT issue. For NFS V4 failover, the issues were archived in: http://sourceforge.net/mailarchive/forum.php?thread_id=26040366&forum_id=4930 and http://sourceforge.net/mailarchive/forum.php?thread_id=26040369&forum_id=4930 The summary can be highlighted by NFS V4 developer J. Bruce Fields' reply as the following: <start quote> > I have been working on setting up NFSv4 in cluster scenario. But when > it comes to an Active-Active setup, it seems there are a few problems. Yes. Even active-passive failover is problematic right now, mainly because the method we use for storing reboot recovery information (analagous to the state recorded by statd) is still in flux. We're aware of these problems and working on solving them. There are some rough (possibly out of date notes) here: http://wiki.linux-nfs.org/index.php/Cluster_Coherent_NFS_design but those are more of interest to someone working on design of a solution than to someone trying to set up and use any of this (which we don't recommend doing yet). <end quote> Before the general NFS v4 dust settled for linux, we don't plan to address NFS V4 failover issues in bugzilla 132823 in order to keep the problem in a managable (and deliverable) state. For people reading this bugzilla, The restrictions described in comment #47 assume there are active IOs going on when failover occurs. If not the case, the bug should be investigated, instead of waiting for this bugzilla to get resolved. Is this issue plaguing RHEL 5's version of NFS also? Yes. Nate indicates this isn't a blocker for 4.6. Moving to 4.7 and setting the blocker flag in the hopes we actually fix this. Maintaining QE ack. What about RHEL5? Is this corrected in 5.1? I need to set people's expectation. For NFS V3, there are non-trivial amount of works involved that *do not* exist in Linux OS today. This is part of the reason *why NFS is revised to V4*. One issue here is that should we spend resource to speed up NFS V4 or spend time to work on V3 that may be get accepted upstream. And no, nothing is in RHEL5. s/may be/may not/ in comment #68. Need to resize this issue based on current workload. Will update the issue by next Friday. Have a very short discussion with Steve Dickson and re-visit action items A-1, A-2, A-3 (see comment #44). Will try to do some works with these three action items to improve V3 failover on next updates (both REHL4 and RHEL5): A-1 is the most difficult one, since the 64K RPC cache needs to get copied between nodes. The issue here is how (via TCP/IP socket ? via RPC ? via RHCS rgmanager ? via DLM ? or via disk ?) and how to get all parties (upstream, cluster group, NFS community, etc) accept the proposal. Will do some prototype and kick-off the discussion at end of this month. A-2 is doable. A-3 is mostly VFS layer work and needs heavy upstream involvements that would be very time consuming. But will try to get some prototype done and kick off a discussion by end of the year. There's been no updated in almost a year - has anything happened on this upstream? |