Bug 727575

Summary: overhaul nfsd4 reboot recovery code
Product: [Fedora] Fedora Reporter: Jeff Layton <jlayton>
Component: kernelAssignee: Jeff Layton <jlayton>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: bfields, dhowells, gansalmon, itamar, jiali, jonathan, kernel-maint, madhu.chinakonda, rwheeler, sprabhu, steved
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-01-03 13:50:22 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jeff Layton 2011-08-02 13:53:24 UTC
knfsd tracks some data in a persistent directory in order to prevent some (rather unlikely) races in state recovery when the server reboots. Currently it does that mostly through the code in fs/nfsd/nfs4recover.c.

That code creates a directory for each saved client id in the v4recoverydir (default set to /var/lib/nfs/v4recovery). It does this though by directly modifying the filesystem using vfs_mkdir, vfs_rmdir, lookup_one_len, etc...

This is considered bad by many folks (though I'm not 100% clear why -- nfsd already does this stuff for "normal" fs access anyway...).

For this bug, I intend to change it to use a more standard upcall mechanism to fetch and store this info. That should also open the door for more elaborate handling of this info for clustering.

I'm working on the initial design here:

    http://wiki.linux-nfs.org/wiki/index.php/Nfsd4_server_recovery

...Bruce started this page, but I've picked it up to see if I can drive this to completion.

Comment 1 Jeff Layton 2011-09-07 18:14:59 UTC
I've started looking at this again with a fresh eye, and I think that
we're probably best off with replacing the current scheme with
something based on the cache_detail infrastructure. It has a lot of
what we want already, I think...

I'll plan to re-write up something on the wiki page, but I'm thinking
something along these lines (based on Bruce's original draft design):

We'll use a new procfile cache dir /proc/net/rpc/nfs4.clientid

create_client / allow_client can essentially use the same upcall. We
can set a flag or command value based on whether the grace period is in
effect. If it's not, then the upcall is essentially "create_client". If
it is, then it's "allow_client", and will only return whether the client
is allowed or not.

This means that the daemon will not feed a list of clientids to the
kernel on startup. The kernel will instead upcall for them on an
as-needed basis.

grace_done can also use the same upcall (with a new command field).
The downcall info will just be discarded. We can potentially just have
the kernel call sunrpc_cache_pipe_upcall directly with a dummy
cache_head.

expire_client will also need to be handled a little differently. We'll
need a way for the kernel to negate a cache_head and do an upcall to
userspace to inform the daemon before returning. We may need a new
cache_delete op for this.

We also eventually want userspace to tell the kernel that there are
no more clients to be reclaimed (grace_done in reverse). So, we may need
some sort of userspace-initiated downcall mechanism for that. That can be
added in a later phase though and will probably require another mechanism
unrelated to the cache_detail infrastructure.

Thoughts?

Comment 2 Jeff Layton 2011-09-08 17:40:01 UTC
Some more notes to flesh out the details...

nfsd4_create_clid_dir and nfsd4_remove_clid_dir will need to be replaced
with functions that upcall to manage the clientid db.

nfsd4_recdir_purge_old will need to be replaced with another upcall that
informs the daemon that the grace period has ended and the old records
should be purged.

nfs4_check_open_reclaim will need to be replaced with a version that
upcalls to determine if the client is allowed to reclaim. The results of
that upcall should be cached.

The reason I prefer this approach instead of feeding clientid records to
the kernel at startup time is that it allows for more flexibility and less
that will need to be "prepared" when a failover occurs.

The cache_detail infrastructure could do this, though it's a bit backward
to use it to track stuff that's primarily generated from kernel space. That
alone might mean upstream resistance to the idea -- Bruce wasn't thrilled
with that approach when we chatted about it yesterday.

We could also use rpc_pipefs and cache the info ourselves. That may be the
best scheme, and is more likely to pass upstream review.

At this point I'm still staring at the existing code, looking over the
RFCs and going over failover scenarios to make sure I understand what we
need out of this.

Comment 3 Jeff Layton 2011-09-16 15:18:25 UTC
I've started work on an implementation that uses rpc_pipefs. I'm creating a "nfsd/" top level dir in rpc_pipefs and will put a "clname" pipe in there for the user to kernel communication.

Bruce had concerns about the "namespace" with this, but it turns out there's already some precedent with the new blocklayout upcall. It just creates a single pipe in one of the top level dirs as well.

The thing I'm trying to determine now is whether and how best to handle multiplexing upcalls. In particular, if two nfsd's want to upcall to recover different clientid's at the same time, should we allow that or serialize them? Serializing them (via a global mutex?) would greatly simplify the code, but it could be a scalability problem down the road.

For that reason, I'm inclined to allow them to be multiplexed, but that's a bit more difficult to manage. When a downcall comes into the pipe, I'll need a way to match it with the upcall and copy the result into the expected buffer.

Comment 4 J. Bruce Fields 2011-09-16 15:56:50 UTC
"it turns out there's already some precedent with the new blocklayout upcall. It just creates a single pipe in one of the top level dirs as well."

OK, great.

"When a downcall comes into the pipe, I'll need a way to match it with the upcall and copy the result into the expected buffer."

Can we match based on struct file?  Then for example userspace could use N threads to handle the upcalls with each thread doing its own open, and as long as they always responded on the same open file they go the upcall on, everything would work.

What about the other users--I guess the new idmapper isn't using a pipe upcall any more, and gssd will eventually move away from that too?

Comment 5 Jeff Layton 2011-09-16 16:44:07 UTC
(In reply to comment #4)

> 
> "When a downcall comes into the pipe, I'll need a way to match it with the
> upcall and copy the result into the expected buffer."
> 
> Can we match based on struct file?  Then for example userspace could use N
> threads to handle the upcalls with each thread doing its own open, and as long
> as they always responded on the same open file they go the upcall on,
> everything would work.
> 

Probably not. The problem is that we don't necessarily know what struct file is going to do the read of a particular message before we queue it. We could try to kludge something like that into the generic pipe struct, but that would put a constraint on the design of the daemon.

What we'll probably have to do is add some sort of "xid" to the upcall, and simply make sure that the downcall sends the same one. When a downcall comes in, we'll copy it to a temp buffer, and look up the matching real buffer for the given xid. Then copy to the new buffer. It sounds like a lot of copies, but it's not that much data and these upcalls shouldn't be very frequent.

> What about the other users--I guess the new idmapper isn't using a pipe upcall
> any more, and gssd will eventually move away from that too?

Yes, idmapper pipe should be going away eventually. I'd also like to see the gssd pipes go away too in favor of something based on keys API, but that's still out a ways.

Comment 6 J. Bruce Fields 2011-09-16 17:57:25 UTC
"The problem is that we don't necessarily know what struct file is
going to do the read of a particular message before we queue it."

That's no problem, I think--the important thing is just that userspace write the response back to the same open that it read the request from.

So in the multithreaded daemon case requests would get queued up and handed out to readers more-or-less at random, which is fine, as long as we have the ability to look at a response, see which struct file it came over, and look at private information in the struct file to identify which request they're responding to.

"I'd also like to see the gssd pipes go away too in favor of something based on keys API, but that's still out a ways."

So, are we making a mistake basing something new on an upcall to a daemon, if all these other things are doing a fork+exec?

Comment 7 Jeff Layton 2011-09-16 18:28:03 UTC
(In reply to comment #6)

> That's no problem, I think--the important thing is just that userspace write
> the response back to the same open that it read the request from.
> 
> So in the multithreaded daemon case requests would get queued up and handed out
> to readers more-or-less at random, which is fine, as long as we have the
> ability to look at a response, see which struct file it came over, and look at
> private information in the struct file to identify which request they're
> responding to.
>

Ahh ok, I see what you mean. We won't know what filp is being used when we
call rpc_pipe_upcall, but we will know that when the ->upcall op is called.
So we won't be able to use rpc_pipe_generic_upcall directly here, but
that's fine.

> "I'd also like to see the gssd pipes go away too in favor of something based on
> keys API, but that's still out a ways."
> 
> So, are we making a mistake basing something new on an upcall to a daemon, if
> all these other things are doing a fork+exec?

We certainly could do this with something fork+exec based. It wouldn't be as
efficient I think, but it is doable. One issue is that this really isn't a
great fit for the keys API. We could do something with a usermode helper
stuff, but that's a lot more cumbersome to deal with. Also, we're going to
be "stuck" with rpc_pipefs for some time anyway for the blocklayout upcall,
and it is reasonably efficient for this sort of thing.

I figure we can have that discussion upstream once I get the initial implementation done. It hopefully won't be too difficult to move it to a different upcall scheme if that's the consensus.

Comment 8 J. Bruce Fields 2011-09-16 20:41:43 UTC
Fair enough, thanks.

Comment 9 Jeff Layton 2011-09-20 18:49:29 UTC
(In reply to comment #6)
>
> So in the multithreaded daemon case requests would get queued up and handed out
> to readers more-or-less at random, which is fine, as long as we have the
> ability to look at a response, see which struct file it came over, and look at
> private information in the struct file to identify which request they're
> responding to.
>

One problem with this idea -- the rpc_pipe infrastructure already uses the filp->private_data to determine whether there's an message that has been queued to the pipe but not read yet. Once the upcall function has been called, it zeroes the private_data pointer:

        res = rpci->ops->upcall(filp, msg, buf, len);
        if (res < 0 || msg->len == msg->copied) {
                filp->private_data = NULL;

...we could maybe change this somehow I suppose, but I worry a bit about monkeying with the core code here.

Comment 10 Jeff Layton 2011-09-20 20:11:13 UTC
Yeah, the more I think about it, I'm not so thrilled with that idea. Here's another issue...

Suppose we have two readers on different filps. One reads half the message, and another races in and starts reading. Then, another races in and starts reading -- does he get the beginning of a new msg, or the rest of the old one? Of course, we could solve this with locking, but I'd rather not have to worry about that.

I think the best solution is probably to do an XID scheme:

Keep an unsigned int variable as an xid counter

While holding a lock, bump the counter

Search the global list of in-progress upcalls and make sure we don't have one that matches the current counter

If so, go to the next one and try again If not, then put that xid in the message and put the message on the list

Drop the spinlock

In most cases, there won't be contention so that should generally proceed w/o needing to do that more than once.

Comment 11 J. Bruce Fields 2011-09-20 22:59:14 UTC
"Suppose we have two readers on different filps. One reads half the message, and
another races in and starts reading. Then, another races in and starts reading
-- does he get the beginning of a new msg, or the rest of the old one? Of
course, we could solve this with locking, but I'd rather not have to worry
about that."

I don't think locking's needed: I think you move the message out of the inode's queue into some structure you can reach from the filp's private data on first read.  From that point on only that filp sees that message.

But I'll trust your judgement as to which approach looks simpler at this point....

Comment 12 Jeff Layton 2011-10-04 13:09:47 UTC
I've made some progress on this code in the last week or so.

I have a patchset that adds a new "nfsd" dir at the top level of rpc_pipefs. It then creates a "clnamed" pipe in that directory. I've also added the code to do an upcall using that pipe to create a new client record. I decided to use an "xid" based scheme for matching upcalls to downcalls. I didn't see an easy way to do it based on the filp if I wanted to use libevent (and we do want to use that here, I think).

I'm currently working on the userspace daemon that will sit on the other end of the pipe. That daemon will use libevent to service the upcalls. I'm still deciding what to do to track the info on stable storage.

My current thinking is to create a new "clnamed" directory or something that defaults to being in the NFS_STATEDIR. That directory will contain another set of directories named after the server-side IP addresses (each upcall sends that info as a matter of course).

Each of those directories will contain files named with the hash of the actual client-provided ID. Each file will contain one or more records with the actual client name that the client sent. That way if there are hash collisions this should still work.

I plan to have the daemon use very careful POSIX locking when working with the "DB". To "clusterize" this, we can just put this directory on shared storage. Multiple accesses by multiple daemons should be no real problem.

An alternate scheme would be to use something like berkeley DB, or sqlite, but the above scheme seems relatively simple, and should work well on pretty much any clustered fs that does locking.

Comment 13 Jeff Layton 2011-10-05 14:51:54 UTC
Ok, I have a proof-of-concept approach started here:

http://git.samba.org/?p=jlayton/linux.git;a=shortlog;h=refs/heads/nfsd-clname

...and...

http://git.linux-nfs.org/?p=jlayton/nfs-utils.git;a=shortlog;h=refs/heads/clnamed

...this uses an rpc_pipefs pipe to do the upcall to userspace. That upcall is scooped up by the daemon which uses libevent to handle the calls. The upcall is a binary format, but the struct is packed and uses fixed-length data types.

So far, the code doesn't actually do anything besides just respond when an upcall comes in. It also only understands one type of upcall currently (the create one). The upcall/downcall communication does seem to work though.

I probably also need to add a way to track these things so we don't need to upcall on every open. Maybe a new flag in the nfs4_client struct or something?

I've also got the new code running in parallel with the old recoverydir stuff at the moment -- eventually I'll have to ifdef out the old code.

As a SWAG, I'd estimate another month or two of solid work to get this going if the current approach is acceptable.

As a transition mechanism, I think I'll need to wrap the existing v4recoverydir code in a Kconfig option, and add a new one for the new code, similar to how the transition to the new idmapper is being handled.

Bruce, does this look like a reasonable approach? Anything you see here that's a show stopper?

Comment 14 J. Bruce Fields 2011-10-05 20:59:17 UTC
Thanks.  OK, so upcall and downcall both use struct clname_msg?

I'd like to talk this over with Trond tomorrow.  He's not been a fan of binary formats for these upcalls in the past.  I suspect his real concerns were for expandability and compatibility.  Command and version number fields, a packed struct, and defined-size integer types should address those issues, I think....

Using an upcall for the on-boot Cln_allow seems inefficient as people with lots of clients will get all those upcalls at once.  But I suppose it's not a showstopper.  Ensuring that it will be easy to run multiple clnamed processes would help us be prepared if that turns out to be an issue.

A Kconfig option is kind of a big hammer as a transition mechanism--ideally we'd adapt at runtime to what's available in userspace--but that's probably not worth it if it's more complicated.

There are a number of reasons why the list of commands and might change:
- We may also want to see if there's anything simple we can do here to lessen the annoyance of the v4 grace period.
- If a namespace/container-based approach (which I'm currently optimistic about) works well, we probably won't need the server IP address.
- I've sometimes thought it might be useful to track state per filesystem (for example, to allow skipping the grace period on a filesystem that was previously unused; or to take advantage of cluster filesystems that might have some more knowledge about per-filesystem lock state?).  In which case we might want to add a filesystem argument.

But if we do want to do any of that, it may be easier once we have something running.

Comment 15 Jeff Layton 2011-10-06 00:55:15 UTC
(In reply to comment #14)
> Thanks.  OK, so upcall and downcall both use struct clname_msg?
> 

Currently, yes.

> I'd like to talk this over with Trond tomorrow.  He's not been a fan of binary
> formats for these upcalls in the past.  I suspect his real concerns were for
> expandability and compatibility.  Command and version number fields, a packed
> struct, and defined-size integer types should address those issues, I think....
> 

Sure. If doing this with text or xdr or something is more palatable, it shouldn't be too hard to adapt.

I also had another thought earlier today that I'll throw out here...

Instead of doing this, we could consider extending statd to track v4 recovery info. We'd have to extend the statd protocol with new procs, but it would mean that people wouldn't need 2 daemons. statd also already has a lot of code for tracking state on-disk.

What we'd need to do is move the statd kernel code to a separate module, and then add the ability for it to handle the new procedures...

> Using an upcall for the on-boot Cln_allow seems inefficient as people with lots
> of clients will get all those upcalls at once.  But I suppose it's not a
> showstopper.  Ensuring that it will be easy to run multiple clnamed processes
> would help us be prepared if that turns out to be an issue.
> 

If we suppose 10000 clients with state, that's 10000 upcalls. That doesn't seem too bad. They wouldn't go all at once though of course. We'll be effectively constrained to the number of nfsd's for simultaneous upcalls.

> A Kconfig option is kind of a big hammer as a transition mechanism--ideally
> we'd adapt at runtime to what's available in userspace--but that's probably not
> worth it if it's more complicated.
> 

Runtime adaptation might be doable. Eventually though, we'll want to deprecate the old code, so adding Kconfig options around this might make that simpler in the future.

Still, runtime adaptation might be tricky -- does the lack of a running daemon mean that I should wait around to see if one shows up, or just go with the "old" scheme.

> There are a number of reasons why the list of commands and might change:

Adding commands shouldn't require revving the struct version (assuming we keep the binary upcall/downcall format in the first place), unless we need new fields.

> - We may also want to see if there's anything simple we can do here to lessen
> the annoyance of the v4 grace period.

I did add a stub commands with that in mind:

Cln_NrToReclaim == basically we can ask userspace how many clients need reclaim at boot time. Then, count down as we reclaim them. At 0, we can lift the grace period. Not particularly helpful with v4.0 since we don't know when the client is done reclaiming, but with 4.1 it could be helpful.

> - If a namespace/container-based approach (which I'm currently optimistic
> about) works well, we probably won't need the server IP address.

I'm inclined to send it in the upcall anyway. If it turns out to be extraneous, then no real harm done. If we decide we need it later, then it may save us from having to jump through hoops to add it.

> - I've sometimes thought it might be useful to track state per filesystem (for
> example, to allow skipping the grace period on a filesystem that was previously
> unused; or to take advantage of cluster filesystems that might have some more
> knowledge about per-filesystem lock state?).  In which case we might want to
> add a filesystem argument.
> 

Ok, something to keep in mind.

> But if we do want to do any of that, it may be easier once we have something
> running.

Yeah, that's sort of what I'm thinking too. A lot of these things will become clearer after the first pass at this.

Comment 16 J. Bruce Fields 2011-10-06 01:24:44 UTC
> Instead of doing this, we could consider extending statd to track v4
> recovery info.

I remember discussing something like that with Chuck at some point.  His take was that the statd code is rather old and anybody that would understand why it was done a certain way is long gone, and it might be rather unpleasant to work with.

But I haven't really tried to check whether that's true; it would probably be worth a day to take a look around the statd code and see what the situation is.

I think our initial goal is to get something basic that we can play with working quickly--so we don't want to spend too much time refactoring old code.

I don't know.  I do like the idea of cutting down on the number of daemons.  And I like the idea of putting into one daemon all the logic to handle reboot recovery across versions.

Another random idea might be to write the new daemon first and then see if it's easy to extend it to handle statd's responsibilities too.

Perhaps that's too much NIH.

> If we suppose 10000 clients with state, that's 10000 upcalls. That doesn't
> seem too bad. They wouldn't go all at once though of course.

Completing 10000 within the grace period would require doing 100-some per second.  I suppose they only require reads, so that shouldn't be so hard, OK.

I was also wondering if there'd be an explosion if we kept per-filesystem state.  (Say you had 1000 clients and 100 filesystems...).  But even there every client isn't normally going to have active state on each filesystem.  And even *then* if it was a problem, I suppose there are ways we could extend this to handle bulk data.  OK, I think I'll stop worrying about this.

Comment 17 Jeff Layton 2011-10-17 18:47:16 UTC
Ok, I (again) have a proof-of-concept daemon that can talk to the kernel on one call. Now I need to decide how best to store the info on-disk.

My initial thought was to use something like BDB or sqlite, but I remember the brewhaha over that with statd. I think though that Trond's objections were more about legacy users and we don't really need to worry about that here, IMO...

The other option is to do something like this design. The idea here is that this directory can live on shared storage (like a GFS2 dir or even on NFS itself). The daemon can run on multiple machines and as long as we get the locking right, all should be ok.

I started hashing out the code, but it's not as simple as it sounds since you need to do things like fsync the file and the containing dir and we have to handle locking to ensure that multiple machines don't clobber the data.

 * This file contains the code to manage a database for the clnamed
 * upcall daemon. The storage format is as follows:
 *
 * The top level directory is defined as CLNAMED_DIR.
 *
 * Inside that directory are other directories that represent the local
 * server address with which the client state was established. Since
 * clients establish state with a particular address, we track client
 * state on a per-server-address basis.
 *
 * Each server address directory contains a group of files named like:
 *
 * client_name_hash:index
 *
 * This is a hash of the client name using md5. The index is there to take
 * care of hash collisions.
 *
 * The actual client_name as provided by the kernel is stored within the file.

BDB or sqlite might be easier. Any thoughts or preferences?

Comment 18 J. Bruce Fields 2011-10-17 19:38:40 UTC
I'm not sure what to recommend.  Possible requirements:

- We need to be able to read the boot recovery information back from different nodes, and after reboot, so whatever format we use has to be stable across versions (in case the library is updated), and across nodes.
- It should be easy for users and developers to examine the state when debugging.
- I'm not sure if performance will turn out to be a limit, but if so, I think the factors would be: time required to commit one of these records to stable storage (which will determine how fast we can respond to a client's first stateful operation); and time to look up clients as they show up on reboot.
- We may want to think about how we'd extend the format if we needed to later.  (E.g. if later we wanted to track one more piece of information about each client, how hard will it be to upgrade existing state?)

Comment 19 Jeff Layton 2011-10-17 19:51:33 UTC
(In reply to comment #18)
> I'm not sure what to recommend.  Possible requirements:
> 
> - We need to be able to read the boot recovery information back from different
> nodes, and after reboot, so whatever format we use has to be stable across
> versions (in case the library is updated), and across nodes.

That shouldn't be a problem with sqlite. They haven't changed the on-disk format since 2006. Hopefully if they ever do, then they'll bump the major version number...

> - It should be easy for users and developers to examine the state when
> debugging.

sqlite has a command-line sqlite tool that you can use to run queries against the db. There may also be gui tools for it (not sure).

> - I'm not sure if performance will turn out to be a limit, but if so, I think
> the factors would be: time required to commit one of these records to stable
> storage (which will determine how fast we can respond to a client's first
> stateful operation); and time to look up clients as they show up on reboot.

I think this is where we'll be better off with a db engine rather than doing
this with a custom disk format. A lot less locking and opening of files that way.

> - We may want to think about how we'd extend the format if we needed to later. 
> (E.g. if later we wanted to track one more piece of information about each
> client, how hard will it be to upgrade existing state?)

That will be a problem no matter what format we use. We'll need to write code
to migrate old to new regardless. If we ever need to do this we can provide a conversion tool or process to convert the data.

In practice, what I would probably do is create a "version" table in the db and have the daemon exit at startup if it can't work within the new format.

We wouldn't want to do this automagically within the daemon since people could run this in a cluster and they want to be certain they don't screw up other hosts that might be working in the db.

Comment 20 Jeff Layton 2011-11-09 14:23:50 UTC
I've made some progress on this over the last week or so. I now have a minimally functional daemon that will:

- start up and check the schema version
- create and attach a database for a server address on a "create" upcall
- insert the client's ID blob into the database
- do a downcall to the kernel with the result

It's still quite rough. To do list:

- sanitize the error handling and reporting. Make sure we're only sending
  POSIX error codes down to the kernel

- add other upcall handlers for other commands. Now that the basic framework
  is in place this should go rather quickly

- clean up the kernel code, and set it up to conditionally compile under a
  new Kconfig option

- more autoconf work to ensure that the dependencies are correct

Once I have something a bit more functional, I'll plan to post a draft
version upstream as an RFC.

Comment 21 Jeff Layton 2011-11-16 12:50:36 UTC
A little more work in the last week or so:

- I added a "remove" upcall, that gets called when the client expires (basically the upcall equivalent of nfsd4_remove_clid_dir).

- I added a timestamp to the table so we know at what point it was inserted or
updated

I also have it set up so that we attach a new database file for each server address (rq_daddr).

Now though, I'm a bit stuck on how to deal with the end of the grace period...

When the grace period ends, we'll want to do the equivalent of purge_old(). That is, we'll want to remove any unreclaimed client records. The problem though is how do we know which ones to remove in a clustered setup?

We can't simply walk all of the address-specific tables and remove any records clientid's that don't have timestamps later than the most recent reboot. It's possible that some of those addresses are "owned" by other hosts. How do we tell what addresses this host now owns?

I suppose we can do an end_grace upcall for each address that the host currently has, but we'll need some way to enumerate all configured addresses in the kernel.

Comment 22 Jeff Layton 2011-11-16 13:45:01 UTC
I think we may need to step back and consider what the grace period means in our proposed configuration...

We're proposing to essentially eliminate the grace period on failovers in an active/active configuration. The idea being that DLM (or whatever) will have the smarts to know that a particular lock is recoverable and in need of recovery.

We do still however need to deal with:

- single server case with non-recoverable locks

- active/passive clusters with non-recoverable locks (similar to single-server case)

- reboot of an entire active/active cluster

All of these cases will require that the server(s) enforce the grace period on startup.

Does it make sense to somehow consider the grace period to be a cluster-wide thing? In an active/active clustered configuration, we lift the grace period simultaneously for all the hosts in the cluster?

If that's the case, then we could designate one "master" node that cleans out all of the unrecovered addresses from all of the tables, and then (somehow) tells the rest of the servers to lift the grace period.

Comment 23 Jeff Layton 2011-12-02 13:44:38 UTC
I chatted with Bruce about this on IRC a few weeks ago. My notes were a bit spotty, but I his idea is essentially to move grace period handling into the filesystem. IOW, nfsd would query the lower fs to see whether it was currently in grace or not, probably via some sort of export op.

Most filesystems could use a common simple grace period timer that works much like it does today. When nfsd starts, we'd start the timer, etc...

Clustered filesystems could do something more elaborate...

For those, I think we'd need a new daemon and "switch" (maybe an ioctl?) for each clustered filesystem to toggle the grace period.

The cluster resource manager would then fire up a new daemon when all of the cluster addresses and nfsds are up. That daemon would just sleep on a timer. When the timer pops, it would toggle the grace period switches on all of the clustered fs' to take them out of grace.

We wouldn't even necessarily need to involve DLM in this at all. The cluster could start this new daemon on all of the nodes and they could communicate with each other via something like corosync, or the new clstated daemon I'm working on could even be outfitted to do this.

Bruce, any thoughts?

Comment 24 J. Bruce Fields 2011-12-02 19:11:16 UTC
(In reply to comment #23)
> I chatted with Bruce about this on IRC a few weeks ago. My notes were a bit
> spotty, but I his idea is essentially to move grace period handling into the
> filesystem. IOW, nfsd would query the lower fs to see whether it was currently
> in grace or not, probably via some sort of export op.

Actually my thought was that nfsd should tell the filesystem when the grace period starts and stops, and then the filesystem should be responsible for enforcing the grace period (by returning appropriate errors to lock calls).

The filesystem might then fail the lock even when the server doesn't think it's in grace period (because the filesystem happens to know that nfsd on another cluster node still is in the grace period).

Or the filesystem might allow the lock even when the server believes we're in the grace period (because the filesystem has better information--e.g. its backend lock manager knows that nobody is waiting to recover a lock on the file in question).

> Most filesystems could use a common simple grace period timer that works much
> like it does today. When nfsd starts, we'd start the timer, etc...

I'd rather the filesystem itself not be made responsible for timing the grace period, because nfsd might have more information:

- nfsd may know that the only clients to recover are 4.1 clients, and that they have all issued RECLAIM_COMPLETE, in which case the grace period can end earlier.
- nfsd might recognize that a client has arrived just as the grace period ended, and consider extended the grace period by a few more seconds to give it a chance to recover.  (OK, it's not likely we'd actually implement that.)

> Clustered filesystems could do something more elaborate...
> 
> For those, I think we'd need a new daemon and "switch" (maybe an ioctl?) for
> each clustered filesystem to toggle the grace period.

So I would expect the flow of information here to be more like:

- server is started; calls s_export_ops->start_grace() to notify filesystem we're starting grace period.
- If fs is an ordinary local filesystem that gets handled by common vfs code that just says "ok somebody's in a grace period, fail any non-reclaim locks from now on."
- If fs is a cluster filesystem it does an upcall to cluster infrastructure (or dlm call of some kind, who knows) to say "server X is starting its grace period".
- at end of grace period nfsd calls s_export_ops->end_grace().
Etc.

?

Comment 25 Jeff Layton 2011-12-02 20:58:16 UTC
(In reply to comment #24)

> - server is started; calls s_export_ops->start_grace() to notify filesystem
> we're starting grace period.
> - If fs is an ordinary local filesystem that gets handled by common vfs code
> that just says "ok somebody's in a grace period, fail any non-reclaim locks
> from now on."
> - If fs is a cluster filesystem it does an upcall to cluster infrastructure (or
> dlm call of some kind, who knows) to say "server X is starting its grace
> period".
> - at end of grace period nfsd calls s_export_ops->end_grace().
> Etc.

Ok, that helps. Some questions though...

First, we have to do this on more than lock calls, right? Opens as well?

How does the server know what filesystems to call start_grace on? Does it call it on every fs? What about fs's that get mounted after nfsd starts up?

Maybe we could modify that a bit. Instead of calling the fs with a start/end grace call, we could present an interface where the fs could query nfsd and ask whether it was currently in grace? Or maybe nfsd could pass that info down to the vfs in the lock/open call?

Comment 26 Jeff Layton 2011-12-14 14:02:27 UTC
I've gone ahead and posted an initial set of patches for this:

    http://marc.info/?l=linux-nfs&m=132387099603720&w=2
    http://marc.info/?l=linux-nfs&m=132387113603774&w=2

The code basically works in its current state, but it needs a bit of cleanup
before it should go in. At this point, I'm looking for feedback on the overall
design before I spend time doing that cleanup.

Comment 27 Jeff Layton 2011-12-23 12:14:39 UTC
On the 3rd iteration of the patches and have declared it a formal submission. With luck, the kernel pieces should make 3.3.

Comment 28 Jeff Layton 2012-01-24 15:07:48 UTC
I've updated the patchset with a fourth iteration, and have requested that it
be merged in 3.4. With this, we should be pretty well settled on what's needed
in the clustered case as well.

Comment 29 Jeff Layton 2012-02-13 15:17:59 UTC
6th version of the kernel patchset sent to linux-nfs today. The changes in
that one shouldn't require any changes in the userspace code. I'm hoping to
get the kernel pieces in for 3.4.

Comment 30 Jeff Layton 2012-02-27 14:52:46 UTC
Ugh...just noticed the rpc_pipefs namespace patches are in Trond's devel tree
and are probably going to go in for 3.4. I'm assuming that I'll need to respin
this set on top of those, once I figure out what I need to do for them...

Comment 31 J. Bruce Fields 2012-02-27 15:24:51 UTC
OK, we'll need to negotiate something with him, but that's doable.  Simplest might be to get him to move them into a branch that he considers stable (so they won't be rebased or whatever), and then we can pull that into the server-side for-3.4 branch and base your work on top of that.

For now best might be to do something like that privately (so merge his development branch with mine, then work on top of that) and the resulting patches will likely be something that will apply.

But we should ask Trond.

Apologies for being slow to respond to the upstream patches.  I'm setting up some reboot recovery testing, since we seem to have let some bugs creep in there.  I have some pynfs tests that I used to run regularly, so it's just a matter of brushing off those scripts and then adding more tests if they don't cover the new cases you found.

If it's possible while you're respinning the patches, it would be nice to have any bugfixes pulled out as separate patches at the start of the series.

Comment 32 Jeff Layton 2012-02-27 15:38:44 UTC
Yeah, shouldn't be too hard to refactor the code, but the namespace stuff has
certainly changed the pipe instantiation and removal a bit more complex. It's
also in flux to some degree at the moment, so who knows when it'll actually
go in? I'll ping Trond to see what he thinks...

Comment 33 Jeff Layton 2012-03-29 15:21:59 UTC
Bruce has merged the latest iteration of the kernel piece for this. Now, just
waiting on Steve D to merge the userspace portion in nfs-utils.

Comment 34 Jeff Layton 2012-10-15 10:22:05 UTC
The userspace portion was merged quite some time ago, but now we're looking at turning this into a usermodehelper upcall instead. Stay tuned...

Comment 35 Jeff Layton 2013-01-03 13:50:22 UTC
Usermodehelper has been merged into kernel and nfs-utils. At this point, I'm going to declare this "done" and close the bug with a resolution of RAWHIDE since this is now in F19.