Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1148619

Summary:

moving NFSv4 export causes client mounts to fail

Product:

Red Hat Enterprise Linux 7

Reporter:

David Vossel <dvossel>

Component:

nfs-utils

Assignee:

J. Bruce Fields <bfields>

Status:

CLOSED WONTFIX

QA Contact:

JianHong Yin <jiyin>

Severity:

high

Docs Contact:

Priority:

high

Version:

7.1

CC:

bcodding, bfields, dvossel, eguan, fdinitto, jruemker, mnovacek, ovasik, sbradley, swhiteho

Target Milestone:

Keywords:

TestBlocker

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-01-07 20:14:20 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Active/Active HA-NFS use-case presentation	none
exportfs debug	none

Description David Vossel 2014-10-01 22:02:40 UTC

Created attachment 943207 [details]
Active/Active HA-NFS use-case presentation

Description of problem:

We need to be able to support the HA NFSv4 Active/Active use-case, where NFSv4 exports move between two active servers with limited interruption on the client side.

I have attached a PDF slide presentation that outlines exactly how this use-case works.

Right now the Active-Active use-case appears to work as expected on rhel6. On rhel7 however we are seeing errors that cause the clients to experience 'stale filehandle' errors after an export relocates. This indicates a regression between rhel6 and rhel7.

In var/log/messages on the client, there is an error that occurs after the server export moves that is likely related to this issue, "NFS: server 192.168.122.200 error: fileid changed" 

Version-Release number of selected component (if applicable):

How reproducible:
100%

Steps to Reproduce:

This setup requires floating IPs and multiple NFS servers using pacemaker

1. Setup the Active-Active HA-NFS use-case as described by the attached presentation slides.
2. on a node outside of the cluster, mount one of the nfs export groups using the floating ip address associated with that group.
3. Place the node exporting the filesystem the client has mounted in standby using 'pcs cluster standby <nodename>'

Actual results:

The nfs mount on the client is no longer available after the export moves.  Performing an ls on the mount directory presents us with the 'stale filehandle' error.

The only way to restore the client is to umount and then mount the nfs export again.

Expected results:

The client maintains access to the nfs mount after the export migrates between active NFS servers.

Comment 2 J. Bruce Fields 2014-10-02 13:57:10 UTC

The filehandle has a part that identifies the filesystem and a part that identifies the particular file within that filesystem.

The usual explanation for this kind of failure is that the part identifying the filesystem changed between servers.  Using the 'fsid=' option consistently on the exports is one way to avoid this (though on a modern system it should default to identifying the filesystem by uuid, which should also remain the same across servers).

On each server, after a mount attempt (succesful or unsuccesful), could you collect the output of:

    exportfs -v
    cat /proc/net/rpc/*/content

That should help figure out how the servers are identifying the exported filesystems.

Comment 3 David Vossel 2014-10-02 14:41:12 UTC

(In reply to J. Bruce Fields from comment #2)
> The filehandle has a part that identifies the filesystem and a part that
> identifies the particular file within that filesystem.
> 
> The usual explanation for this kind of failure is that the part identifying
> the filesystem changed between servers.  Using the 'fsid=' option

here's a thought

The exports that are movable between servers exist on shared storage. Their fsid values and the physical storage device never change. The root export (fsid=0) will always have the same fsid id, but the physical storage device will be different on every server. The root filesystem for the exports is unique for every server whereas the non-root filesystems are completely consistent between servers.

> consistently on the exports is one way to avoid this (though on a modern
> system it should default to identifying the filesystem by uuid, which should
> also remain the same across servers).

Looking at the man page for exports, the fsid should uniquely identify the export and not the UUID of the filesystem. Perhaps there is a problem here?

I have another data point that seems to indicate this is where the problem is. When I place the root export on shared storage and migrate the root export with the other shared exports, everything is fine... So, if the root export is physically the same device, we have no problem after the move.  If the root export's device is unique between servers, clients fail after the shared exports move.

again, this same scenario that fails in rhel7 works in rhel6.

> On each server, after a mount attempt (succesful or unsuccesful), could you
> collect the output of:

The mounts are always successful.

> 
>     exportfs -v
>     cat /proc/net/rpc/*/content
> 
> That should help figure out how the servers are identifying the exported
> filesystems.

Yep, I can gather this.

-- David

Comment 5 J. Bruce Fields 2014-10-02 18:54:58 UTC

"I have another data point that seems to indicate this is where the problem is. When I place the root export on shared storage and migrate the root export with the other shared exports, everything is fine... So, if the root export is physically the same device, we have no problem after the move.  If the root export's device is unique between servers, clients fail after the shared exports move."

That's interesting!  In that case maybe comment #2 is wrong: it may not be the filesystem part of the filehandle that's the problem, but instead the part that identifies the actual file.

It'd still be useful to get that export information just to confirm there's no problem there.

"again, this same scenario that fails in rhel7 works in rhel6."

But I don't have an explanation for this.

Are you setting up the pseudofilesystem explicitly with an fsid=0 export in all of these tests?

If so, I wonder if there's something different about how you're creating that filesystem in the rhel6 and rhel7 cases.

What are the inode numbers of the directories in the "pseudofilesystem"?  Is it possible that in the rhel6 case they were created in some way that happened to give the corresponding directories the same inode numbers, but in the rhel7 case they weren't?

Comment 6 J. Bruce Fields 2014-10-02 20:34:47 UTC

But if it's the pseudofilesystem that's the issue, then there's one more thing I don't understand: the client normally only uses the pseudofilesystem at mount time.

Is the client actually mounting / itself?

There might also be a race condition here where the mount might fail if the failover happened while a mount was in progress--I'll look into that--but I'd expect that to be harder to reproduce, so not what you're seeing here.

Comment 7 David Vossel 2014-10-02 20:36:33 UTC

(In reply to J. Bruce Fields from comment #5)
> "I have another data point that seems to indicate this is where the problem
> is. When I place the root export on shared storage and migrate the root
> export with the other shared exports, everything is fine... So, if the root
> export is physically the same device, we have no problem after the move.  If
> the root export's device is unique between servers, clients fail after the
> shared exports move."
> 
> That's interesting!  In that case maybe comment #2 is wrong: it may not be
> the filesystem part of the filehandle that's the problem, but instead the
> part that identifies the actual file.
> 
> It'd still be useful to get that export information just to confirm there's
> no problem there.
> 
> "again, this same scenario that fails in rhel7 works in rhel6."
> 
> But I don't have an explanation for this.
> 
> Are you setting up the pseudofilesystem explicitly with an fsid=0 export in
> all of these tests?

yes, the setup is identical. This test is setup using a template scenario file that runs the exact same way on both rhel7 and rhel6. https://github.com/davidvossel/phd/blob/master/scenarios/nfs-active-active.scenario

> If so, I wonder if there's something different about how you're creating
> that filesystem in the rhel6 and rhel7 cases.
> 
> What are the inode numbers of the directories in the "pseudofilesystem"?  Is
> it possible that in the rhel6 case they were created in some way that
> happened to give the corresponding directories the same inode numbers, but
> in the rhel7 case they weren't?

you are onto something here.
For rhel6 the fsid=0 directory does have the same inode on both nodes. For rhel7 they are different.

[root@rhel6-auto1 ~]# ls -i /mnt/
134506 exports
[root@rhel6-auto2 ~]# ls -i /mnt/
134506 exports

[root@rhel7-auto1 ~]# ls -i /mnt/
 1248784 exports  17323771 gfs2share
[root@rhel7-auto2 ~]# ls -i  /mnt/
 1257550 exports  18886413 gfs2share

This is concerning. If this is in fact the problem, what options do we have to make this reliable?

-- David

Comment 8 David Vossel 2014-10-02 20:38:29 UTC

(In reply to J. Bruce Fields from comment #6)
> But if it's the pseudofilesystem that's the issue, then there's one more
> thing I don't understand: the client normally only uses the pseudofilesystem
> at mount time.
> 
> Is the client actually mounting / itself?

no, the client is external to all the nfs servers.

> There might also be a race condition here where the mount might fail if the
> failover happened while a mount was in progress--I'll look into that--but
> I'd expect that to be harder to reproduce, so not what you're seeing here.

good to look into, but that's not happening here

Comment 9 David Vossel 2014-10-02 20:40:45 UTC

Created attachment 943572 [details]
exportfs debug

I've uploaded the debug you asked for.

cat /proc/net/rpc/*/content
exportfs -v

The txt file shows the state of the system before exports move between servers and after. I've included output from both rhel6 and rhel7.

Comment 10 J. Bruce Fields 2014-10-03 00:56:28 UTC

(In reply to J. Bruce Fields from comment #6)
> There might also be a race condition here where the mount might fail if the
> failover happened while a mount was in progress--I'll look into that--but
> I'd expect that to be harder to reproduce, so not what you're seeing here.

(In reply to David Vossel from comment #8)
> (In reply to J. Bruce Fields from comment #6)
> > But if it's the pseudofilesystem that's the issue, then there's one more
> > thing I don't understand: the client normally only uses the pseudofilesystem
> > at mount time.
> > 
> > Is the client actually mounting / itself?
> 
> no, the client is external to all the nfs servers.

What I meant was: is it doing something like mount server:/ /mnt/ ?

I don't completely understand this snippet from the file you linked:

  pcs -f $tmpfile resource create nfs-client-v4 Filesystem device=${PHD_ENV_floating_ips1}:/ directory=/nfsclientv4 fstype=nfs4

but it looks like it might be?  I think I do understand why that wouldn't always work.

> > There might also be a race condition here where the mount might fail if the
> > failover happened while a mount was in progress--I'll look into that--but
> > I'd expect that to be harder to reproduce, so not what you're seeing here.
> 
> good to look into, but that's not happening here

I talked the Trond, unfortunately I think it's likely we do have a problem here that we'll have to figure out how to fix.

So I think the situation for now is:

 - we should tell people doing HA to mount individual exports, not /.

 - and we might need to warn them that there could be temporarily mount failures if the mount happens at the same time as failover.

Thanks for the export information, I haven't had a chance to look that over yet, but I think we may understand the issue now.

Comment 11 J. Bruce Fields 2014-10-03 14:59:28 UTC

In both rhel6 and rhel7 cases on the target server after the failover /proc/net/rpc/nfsd.fh/content has only:

192.168.122.0/255.255.255.0 1 0x00000000 /mnt/exports

Which means that's the only export the client attempts to access after the failover.

That looks consistent with the client mounting only floating-ip:/ (== /mnt/exports).

So, I think this is expected and we need to either tell people to mount only the "real" exports underneath, or we need to arrange for the filesystems on the two nodes to be identical clones of each other.

The drawback of the first approach is that the problem can still happen when a mount happens at the same time as failover.

The second approach is more complicated. You can create a small filesystem to use as the pseudofilesystem and copy the filesystem image to both servers. The extra step is annoying but perhaps we could provide some help. Another problem is that using fsid=0 causes v3 and v4 clients to see different paths (e.g. v3 clients have to mount floating-ip:/mnt/exports/export1 to get the same filesystem that v4 clients mount using floating-ip:/export1). We could fix that with some changes to mountd.

You can leave out the fsid=0 export entirely and recent rhel6 and rhel7 will construct a pseudofilesystem automatically, giving consistent paths with v3 and v4 (so, both could mount /mnt/exports/export1 in the above example). However the pseudofilesystem in that case is just a tightly restricted export of the server's actual /, so for example filehandles are constructed using inode numbers from the exported filesystem. Arranging for the root filesystem on both servers to be identical sounds like more of a pain. Perhaps we could find a different way to construct that pseudofilesystem.

Comment 12 David Vossel 2014-10-03 16:14:32 UTC

(In reply to J. Bruce Fields from comment #10)
.
.
.
> 
> I talked the Trond, unfortunately I think it's likely we do have a problem
> here that we'll have to figure out how to fix.
> 
> So I think the situation for now is:
> 
>  - we should tell people doing HA to mount individual exports, not /.

This definitely gets us closer. I tested mounting using the individual exports and failover does not cause a stale file handler issue on the client.

Now that the client can maintain access to the mount after the failover I've encountered another issue I can't explain.

Part of our regression tests is for the client to hold a file lock on the nfs mount, and then to perform a server side export failover while the lock is still being held.

For the rhel7 test setup, the lock reclaim fails for the NFSv4 mount
kernel: NFS: nfs4_reclaim_open_state: Lock reclaim failed!

I performed the same test using rhel6 and the lock reclaim happens after the failover as expected.

On both rhel7 and rhel6 I'm mounting the client using the specific export and not the root export as we discussed.

# mount -v -o "vers=4" 192.168.122.200:/export1 /nfsclientv4

from there I grab a lock on a file within the export using the fcntl lock util I've created here, https://raw.githubusercontent.com/davidvossel/phd/master/misc/fnctl_locker.c

# fnctl_locker -f /nfsclientv4/clientsharefile

After the lock is held I fail over export1 from one server to another. rhel6 works, rhel7 doesnt.  Any ideas?

-- David

Comment 13 J. Bruce Fields 2014-10-03 19:38:45 UTC

(In reply to David Vossel from comment #12)
> After the lock is held I fail over export1 from one server to another. rhel6
> works, rhel7 doesnt.  Any ideas?

I guess the most likely reasons are that the target wasn't in a grace period, or that the reboot-recovery information didn't get moved over so the server doesn't think this client is allowed to reclaim.

Oh, wait, I bet that's it: do the rhel7 servers have /var/lib/nfs/nfsdcltrack/main.sqlite file?

We switched to a different system for tracking the reboot recovery information and I can't remember if that change made it to rhel7--probably so.  So I bet we just need to copy over that nfsdcltrack directory on failover.

(I'd recommend also continuing to copy /var/lib/nfs/nfsdcltrack/, just in case it's needed for backwards compatibility, but it will probably normally be empty (or even absent).)

Comment 14 David Vossel 2014-10-03 20:23:01 UTC

(In reply to J. Bruce Fields from comment #13)
> (In reply to David Vossel from comment #12)
> > After the lock is held I fail over export1 from one server to another. rhel6
> > works, rhel7 doesnt.  Any ideas?
> 
> I guess the most likely reasons are that the target wasn't in a grace
> period, or that the reboot-recovery information didn't get moved over so the
> server doesn't think this client is allowed to reclaim.
> 
> Oh, wait, I bet that's it: do the rhel7 servers have
> /var/lib/nfs/nfsdcltrack/main.sqlite file?

yes, the file is present on the rhel7 cluster. The file has content in it as well indicating that it is used.

> We switched to a different system for tracking the reboot recovery
> information and I can't remember if that change made it to rhel7--probably
> so.  So I bet we just need to copy over that nfsdcltrack directory on
> failover.

I'm thinking it likely isn't that simple. We need to dynamically move exports between multiple active servers. An export might move from server A to server B.  That doesn't mean that all the exports moved from server A though. That also doesn't mean that server B wasn't already hosting exports before the move occurred.

If the main.sqlite file is required to track what clients can reclaim what locks, we actually have to merge the contents of one server's main.sqlite file with the contents of another.

Do you have any thoughts on this?

> (I'd recommend also continuing to copy /var/lib/nfs/nfsdcltrack/, just in
> case it's needed for backwards compatibility, but it will probably normally
> be empty (or even absent).)

Comment 16 J. Bruce Fields 2014-12-11 17:18:26 UTC

This bug has gotten a bit complicated.  I believe there are three independent issues remaining:

 - clients mounting the server's "/" can get ESTALE after failover.  Workaround is not to mount "/".
 - clients mounting could get ESTALE if they happen to mount at exactly the moment failover is in progress.  Workaround is probably to retry the mount.  I believe there is actually some logic in the VFS to retry at least one ESTALE so this may be hard to hit.
 - we depend on being able to merge the list of clients on failover.

The first two issues really have the same root cause.  My impression is that they're lower priority, but that the third is still a real obstacle?

Comment 17 J. Bruce Fields 2014-12-11 17:29:38 UTC

The third I think can be fixed for now in the 4.0-only case by a simple program that merges the two tables, throwing out any duplicates (possibly with a warning, since I don't believe there should be any duplicates in the 4.0 case.)

NFSv4.0 client use a different client identifier for each distinct server IP address.  Clients using version >=4.1 use the same client identifier to different servers.  I think that may cause clients to be permitted to reclaim locks in some cases where they shouldn't, but I need to write out the case to be sure.

For now we could disable >=4.1 on the server side.  I think that's doable for 7.1.

Comment 18 J. Bruce Fields 2014-12-15 18:03:37 UTC

I'd just like to make sure we're in agreement on comment 16 & 17: so for now are you OK with:
  - mounting only "real" exports, not "/"?  (And maybe dealing with the first two issues from comment 16 in a separate lower-priority bug).
  - possibly restricting to 4.0 and disabling NFS versions >=4.1?

Comment 19 J. Bruce Fields 2014-12-16 22:56:50 UTC

(In reply to J. Bruce Fields from comment #17)
> NFSv4.0 client use a different client identifier for each distinct server IP
> address.
>
> Clients using version >=4.1 use the same client identifier to
> different servers.

I forget that 4.0 clients may behave this way too, though it shouldn't be the default; see the "migration/nomigration" options in the nfs man page (which warns that "Some  server features misbehave" if it's changed from the default), and see the (rather verbose) https://tools.ietf.org/html/draft-ietf-nfsv4-rfc3530-migration-update-05.

> I think that may cause clients to be permitted to
> reclaim locks in some cases where they shouldn't, but I need to write out
> the case to be sure.

Somewhat contrived, but:

 - Client A mounts exports from servers B and C.
 - B and C are rebooted, and simultaneously A loses contact with B.
 - Client A successfully reclaims locks on C, but B's grace period expires without A being able to reclaim locks on B.
 - We move exports from C to B.  Simultaneously, the network between A and B comes back up.

At this point, the merged view of B and C's client records make it appear that A should be able to reclaim any locks it wants to.  But it should really only be permitted to reclaim locks on the exports that migrated from C.

Comment 20 David Vossel 2014-12-16 23:20:33 UTC

(In reply to J. Bruce Fields from comment #18)
> I'd just like to make sure we're in agreement on comment 16 & 17: so for now
> are you OK with:
>   - mounting only "real" exports, not "/"?  (And maybe dealing with the
> first two issues from comment 16 in a separate lower-priority bug).

yes. based on the discussion in this issue, not mounting "/" in this scenario is reasonable.

>   - possibly restricting to 4.0 and disabling NFS versions >=4.1?

Lets track the NFSv4 client state merge discussion in a separate issue. I'm not confident we'll be able to solve that for 7.1.

Comment 21 J. Bruce Fields 2015-01-07 20:14:20 UTC

(In reply to David Vossel from comment #20)
> (In reply to J. Bruce Fields from comment #18)
> > I'd just like to make sure we're in agreement on comment 16 & 17: so for now
> > are you OK with:
> >   - mounting only "real" exports, not "/"?  (And maybe dealing with the
> > first two issues from comment 16 in a separate lower-priority bug).
> 
> yes. based on the discussion in this issue, not mounting "/" in this
> scenario is reasonable.
> 
> >   - possibly restricting to 4.0 and disabling NFS versions >=4.1?
> 
> Lets track the NFSv4 client state merge discussion in a separate issue. I'm
> not confident we'll be able to solve that for 7.1.

OK, based on this I'm guessing we should close this bug, and later open some more specific set of bugs for the more specific issues we've identified (at a minimum, for the client-record merging).