132823 – (RHEL4NFSFailover) RHEL4 U1: NFS cluster failover (hw)

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 132823 (RHEL4NFSFailover) - RHEL4 U1: NFS cluster failover (hw)

Summary: RHEL4 U1: NFS cluster failover (hw)

Keywords:
Status:	CLOSED WONTFIX
Alias:	RHEL4NFSFailover
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	6.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Peter Staubach
QA Contact:	Corey Marthaler
Docs Contact:
URL:
Whiteboard:	Kernel
Duplicates (2):	166543 178057 (view as bug list)
Depends On:	166458 166701 167571 167572 175215 175229
Blocks:	139847 180185 430698
TreeView+	depends on / blocked

Reported:	2004-09-17 14:38 UTC by Tim Burke
Modified:	2009-09-30 17:42 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-09-30 17:42:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
NFS failover unit test script (requires distributed ssh keys) (5.24 KB, text/plain) 2004-12-02 17:43 UTC, Lon Hohberger	no flags	Details
messages / sysrq-t output (48.32 KB, text/plain) 2005-12-16 19:01 UTC, Lon Hohberger	no flags	Details
View All

Description Tim Burke 2004-09-17 14:38:16 UTC

There are hooks in NFS that we added both in RHEL2.1 and RHEL3 which are needed
for NFS failover.  They are necessary to allow failover to occur without estale
errors.

Comment 7 Lon Hohberger 2004-12-02 17:43:31 UTC

Created attachment 107780 [details]
NFS failover unit test script (requires distributed ssh keys)

Don't know if it helps, but here's the unit test I used for:

(1) Basic NFS mount/umount test of cluster NFS export
(2) Normal I/O to clustered NFS export
(3) Normal I/O during restart of cluster NFS export
(4) Normal I/O during relocate of cluster NFS export
(5) Nomral I/O during failover of cluster NFS export (by rebooting the active
node)

You have to customize all the variables at the top of the script and distribute
ssh keys (from the client to the servers).  I can help carve out a cluster
if/when necessary

Comment 17 Lon Hohberger 2005-09-14 16:34:52 UTC

Adding all the currently open bugzillas related to NFS behavior during/after
relocation/failover on RHEL4

Comment 20 Kiersten (Kerri) Anderson 2005-10-20 20:51:12 UTC

*** Bug 166543 has been marked as a duplicate of this bug. ***

Comment 22 Lon Hohberger 2005-12-05 19:19:52 UTC

Adding alias.

Comment 23 Lon Hohberger 2005-12-05 20:22:28 UTC

Based on things I tested earlier today, RHEL4 may not need rmtab maintenance at
all; in GFS or non-GFS exports.  We're looking in to it more.

Comment 24 Lon Hohberger 2005-12-07 20:15:21 UTC

* RHEL4 does not need rmtab maintenance like RHEL3.
* rgmanager should have inheritable fsid tags to make configuration easier
(rather than having to do fsid=x for every NFS client ... :( )
* Added relevant fsid inheritance bugzilla to this for tracking.

Comment 25 Lon Hohberger 2005-12-07 22:33:40 UTC

test-script-blue performs 5 basic sanity tests with both TCP & UDP for nfsv3
exports in Red Hat Cluster Suite.

* mount, umount
* mount, perfom some I/O, unmount
* mount, restart service on same node while doing I/O, unmount
* mount, relocate service to different node while doing I/O, unmount
* mount, obliterate server running service (reboot -fn), causing a hard
failover, unmount

When the I/O are running, it should not return an I/O error, ESTALE, EBUSY,
EPERM, etc...

Here's a test matrix running it on a RHCS4 cluster with GFS and ext3-based NFS
exports.  Either it passes all 5 tests as noted above, or it fails.  Note that
the sleep-hack (a hack which lets nfsd hopefully clear its queue of pending
requests by sleeping for 10 seconds) is present.

              RHEL3     RHEL4
ext3 export   Pass      Pass
gfs export    Pass      Pass

In all cases, the client took a long time (5-15 minutes to recover).  AFAIK,
this is not a cluster problem; the retry time increases with each successful
failure with NFS.

When using RHEL4 as a client, the CPU got pegged with ksoftirqd after about 4-5
minutes - it seems like when nfs is in retry-loop it enters some infinite loop
condition after that time.  What is weird is that the NFS service has failed
over and is mountable by other clients long before the waiting client enters
this weird state.

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND           
    3 root      39  19     0    0    0 S 99.9  0.0  12:37.34 ksoftirqd/0  

After about 10-11 minutes of being CPU-pegged, the NFS client recovers and life
goes on.  This behavior does not happen with a RHEL3 client.  This could be
tunable; I don't know, but CPU-pegged seems bad...

Comment 26 Lon Hohberger 2005-12-07 22:41:49 UTC

Other thing to note: the *failover* test is only done using TCP, and there's a
15-20 second failover time, and this test is the only one which takes a really
long time (>5 minutes) to complete.  I could switch to udp, but I think we care
most about tcp.  Steve can make that call, in any case.

Comment 27 Steve Dickson 2005-12-09 11:14:30 UTC

Could you please post a sysrq-t system backtrace and sysrq-p backtrace
of the cpus... when the CPU get pegged...

With the >5 min delay a sysrq-t would be good and a ethereal dump
to see of there is any NFS traffic going over the wire would also
be good...

Comment 28 Thorsten Scherf 2005-12-14 14:20:34 UTC

the section in question looks like this:

Section "Device"
        Identifier  "Videocard0"
        Driver      "i810"
        VendorName  "Videocard vendor"
        BoardName   "Intel 810"
EndSection

Comment 29 Thorsten Scherf 2005-12-14 14:25:06 UTC

ignore last comment, wrong bz id

Comment 30 Lon Hohberger 2005-12-16 17:36:29 UTC

Steve -- I'll get you the stack trace today.

Comment 31 Lon Hohberger 2005-12-16 19:01:23 UTC

Created attachment 122341 [details]
messages / sysrq-t output

Comment 32 Lon Hohberger 2005-12-16 19:05:42 UTC

Ok, I have a tcpdump, but it's like 400MB, so I'm not going to attach it.

I started the tcpdump as soon as my service-relocate completed, and hit ctrl-C
on my application so that it would stop as soon as it came out of disk-wait:

Fri Dec 16 13:45:32 EST 2005
13:45:32.336938 IP red.lab.boston.redhat.com.799 > 192.168.79.11.nfs: . ack
2110919727 win 1728 <nop,nop,timestamp 848259206 78019328>
13:45:32.338870 IP 192.168.79.11.nfs > red.lab.boston.redhat.com.799: . ack
4166286499 win 16022 <nop,nop,timestamp 270554560 847653134>
13:45:32.338899 IP red.lab.boston.redhat.com.799 > 192.168.79.11.nfs: . ack 1
win 1728 <nop,nop,timestamp 848259206 78019328>
13:45:32.338919 IP 192.168.79.11.nfs > red.lab.boston.redhat.com.799: . ack
4166286499 win 16022 <nop,nop,timestamp 270554560 847653134>
13:45:32.338938 IP red.lab.boston.redhat.com.799 > 192.168.79.11.nfs: . ack 1
win 1728 <nop,nop,timestamp 848259206 78019328>

...

It recovered at this point:

14:02:32.529162 IP red.lab.boston.redhat.com.8092349 > 192.168.79.11.nfs: 1448
write [|nfs]

Could these be some of the dup-ack problems?

Comment 39 Alex Samad 2006-04-07 04:57:40 UTC

Hi

I was wondering if there has been any resolution to this ? 

Thanks

Comment 44 Wendy Cheng 2006-08-02 19:35:43 UTC

To provide minimum nfs failover functionality with NFS V2/V3, following
are the tentative (base kernel) work item list:
                                                                                
A-1: NFSD request reply cache gets copied from taken-over server into
     new server upon failover. Cache size = 1024 * struct svc_cacherep
     (64K maximum). (both upstream and RHEL kernels).
A-2: Allow RPC layer to close (TCP) socket connection without going into
     TIME_WAIT state. (RHEL kernel only).
A-3: Allow umounting a filesystem (via kernel force umount call) regardless
     of open file references count. This is to immune failover from the
     forever possible kernel and/or filesystem bugs that somehow leave
     file reference count dangling around.  This feature (todo item) is
     mentioned in linux 2006 kernel summit (http://lwn.net/Articles/191926/).
     (both upstream and RHEL kernels).

RHEL 4.4 NFS failover restrictions
----------------------------------
B-1: Unless NFS client applications can tolerate ESTALE and/or EPERM errors,
     IO activities on the failover ip interface must be temporarily quiesced
     until active-active failover transition completes. This is to avoid
     non-idempotent NFS operation failure on the new server. (check out 
     "Why NFS Sucks" by Olaf Kirch, placed as "kirch-reprint.pdf" in 2006
     OLS proceeding from https://ols2006.108.redhat.com/). 
B-2: With various possible base kernel bugs outside RHCS' control, there 
     are possibilities that local filesystem (such as ext3) umount could 
     fail. To ensure data integrity, RHCS will abort the failover. Admin 
     could either specify the self-fence (reboot taken-over server) option 
     to force failover (via cluster.conf file) or re-mount the filesystem 
     on the taken-over server as ro (read-only) to allow failover. Both 
     options have the possibility of losing data. (side note: not sure 
     whether we could do re-mount as read-only in user space - need to 
     check). This restriction doesn't apply to GFS cluster filesystem.
B-3: If nfs client invokes NLM locking call, the subject nfs servers (both
     take-over and take-over) will enter a global 90-second (tunable) 
     locking grace period for every nfs service on the servers.
B-4: If NFS-TCP is involved, failover should not be issued on the same pair 
     of machines multiple times within 30-minute period; for example, 
     failing over from node A to B, then immediately failing from B back to 
     A would hang the connection. This is to avoid TCP TIME_WAIT issue.

RHEL 4.5 and/or RHEL 5.1 Improvement
-------------------------------------
C-1: Implement A-1 but subject to upstream acceptance.
C-2: Improve forced ro-remount (B-1) so it is less likely to lose data.Also 
     subject to upstream acceptance.
C-3: B-3 issue - subject to the acceptance of the patches submitted in:
     https://www.redhat.com/archives/cluster-devel/2006-August/msg00000.html.
C-4: A-3 issue - backport the upstream changes into RHEL with restrictions 
     (may not work well if NFS client and servers are across firewall).
                                                                                
Long Term Items:
---------------
D-1: Implement A-3 to lift most of the failover issues.
D-2: Note that NFS RPC reply cache is not fool-proof. May need to study
     filesystem specific (particularly GFS) feature (such as piggy-back
     state information into NFS filehandle) to allow error free failover.

Comment 46 Wendy Cheng 2006-08-03 18:32:45 UTC

*** Bug 178057 has been marked as a duplicate of this bug. ***

Comment 47 Wendy Cheng 2006-08-03 19:25:11 UTC

Status Summary:

With current state of linux kernels (both RHEL and upstream), NFS failovers 
have been error-prone. An example is the "non-idempotent" issues associated 
with operations such as "rm" or "rename". This could start with (nfs) client
sending in a file remove request ("rm"). The (nfs) server passes the call 
into filesystem and somehow gets stuck there for a while (say waiting for a
directory lock). Timeout occurs in client side and re-transmit happens. 
Current linux NFSD is designed to handle this issue by a global "request 
reply cache" where each request is checked with entries in the cache. If a 
duplication is ever found, the re-transmitted request is subsequently dropped. 
However, in a failover scenario, if the back-up server gets the re-transmitted
request, and if there is no equivialent entry in its own reply cache to 
protect this duplication, the client may end up getting various errors 
(ESTALE, EPERM, etc) return code, regardless the request has been carried 
out and succeeded in the original server.

To ease these inherited NFS protocol issues, we've tentatively identified 
several work items that we plan to address in next few RHEL updates and/or 
releases. However, we've flagged the planned changes as private events in
this bugzilla to avoid false expectations, mostly because the fixes are 
subject to the acceptance of upstrem linux kernel community and RHEL overall 
product directions.  

Before we complete the work, for NFS v2/V3, RHEL 4.4 has the following 
restrictions:  

B-1: Unless NFS client applications can tolerate ESTALE and/or EPERM errors,
     IO activities on the failover ip interface must be temporarily quiesced
     until active-active failover transition completes. This is to avoid
     non-idempotent NFS operation failure on the new server. (check out
     "Why NFS Sucks" by Olaf Kirch, placed as "kirch-reprint.pdf" in 2006
     OLS proceeding). 
B-2: With various possible base kernel bugs outside RHCS' control, there
     are possibilities that local filesystem (such as ext3) umount could
     fail. To ensure data integrity, RHCS will abort the failover. Admin
     could specify the self-fence (reboot taken-over server) option
     to force failover (via cluster.conf file). 
B-3: If nfs client invokes NLM locking call, the subject nfs servers (both
     taken-over and take-over) will enter a global 90-second (tunable)
     locking grace period for every nfs service on the servers.
B-4: If NFS-TCP is involved, failover should not be issued on the same pair
     of machines multiple times within 30-minute period; for example,
     failing over from node A to B, then immediately failing from B back to
     A would hang the connection. This is to avoid TCP TIME_WAIT issue.

For NFS V4 failover, the issues were archived in:
                                                                                
http://sourceforge.net/mailarchive/forum.php?thread_id=26040366&forum_id=4930
and
http://sourceforge.net/mailarchive/forum.php?thread_id=26040369&forum_id=4930
                                                                                
The summary can be highlighted by NFS V4 developer J. Bruce Fields' reply as 
the following:
                                                                                
<start quote>
> I have been working on setting up NFSv4 in cluster scenario. But when
 > it comes to an Active-Active setup, it seems there are a few problems.
                                                                                
 Yes.  Even active-passive failover is problematic right now, mainly
 because the method we use for storing reboot recovery information
 (analagous to the state recorded by statd) is still in flux.
                                                                                
 We're aware of these problems and working on solving them.  There are
 some rough (possibly out of date notes) here:
                                                                                
 http://wiki.linux-nfs.org/index.php/Cluster_Coherent_NFS_design
 but those are more of interest to someone working on design of a
 solution than to someone trying to set up and use any of this (which we
 don't recommend doing yet).
                                                                                
<end quote>

Before the general NFS v4 dust settled for linux, we don't plan to address 
NFS V4 failover issues in bugzilla 132823 in order to keep the problem in a
managable (and deliverable) state.

Comment 57 Wendy Cheng 2006-09-20 17:36:28 UTC

For people reading this bugzilla,

The restrictions described in comment #47 assume there are active IOs going on
when failover occurs. If not the case, the bug should be investigated, instead
of waiting for this bugzilla to get resolved.

Comment 63 Eugene Tay 2007-06-12 18:40:30 UTC

Is this issue plaguing RHEL 5's version of NFS also?

Comment 64 Wendy Cheng 2007-06-12 18:44:23 UTC

Yes.

Comment 66 Jay Turner 2007-09-05 18:06:19 UTC

Nate indicates this isn't a blocker for 4.6.  Moving to 4.7 and setting the
blocker flag in the hopes we actually fix this.  Maintaining QE ack.

Comment 67 Rafael Godínez Pérez 2007-10-15 01:00:13 UTC

What about RHEL5?
Is this corrected in 5.1?

Comment 68 Wendy Cheng 2007-10-15 02:56:06 UTC

I need to set people's expectation. For NFS V3, there are non-trivial amount
of works involved that *do not* exist in Linux OS today. This is part of the 
reason *why NFS is revised to V4*. 

One issue here is that should we spend resource to speed up NFS V4 or spend
time to work on V3 that may be get accepted upstream. 

And no, nothing is in RHEL5.

Comment 69 Wendy Cheng 2007-10-15 02:57:19 UTC

s/may be/may not/ in comment #68.

Comment 70 Wendy Cheng 2007-10-15 03:00:36 UTC

Need to resize this issue based on current workload. Will update the
issue by next Friday.

Comment 71 Wendy Cheng 2007-10-19 15:51:17 UTC

Have a very short discussion with Steve Dickson and re-visit action 
items A-1, A-2, A-3 (see comment #44). Will try to do some works with
these three action items to improve V3 failover on next updates (both
REHL4 and RHEL5):

A-1 is the most difficult one, since the 64K RPC cache needs to get 
copied between nodes. The issue here is how (via TCP/IP socket ? via  
RPC ? via RHCS rgmanager ? via DLM ? or via disk ?) and how to get
all parties (upstream, cluster group, NFS community, etc) accept the 
proposal. Will do some prototype and kick-off the discussion at end 
of this month. 

A-2 is doable.

A-3 is mostly VFS layer work and needs heavy upstream involvements that
would be very time consuming. But will try to get some prototype done
and kick off a discussion by end of the year.

Comment 74 Bill Nottingham 2008-10-09 18:53:07 UTC

There's been no updated in almost a year - has anything happened on this upstream?

Note You need to log in before you can comment on or make changes to this bug.