Bug 1687360

Summary:

RHEL7.4 NFS4.1 client and server repeated SEQUENCE / TEST_STATEIDs with SEQUENCE Reply has SEQ4_STATUS_RECALLABLE_STATE_REVOKED set - even after NFS server patch to properly return NFS4ERR_DELEG_REVOKED or NFS4ERR_BAD_STATEID for revoked delegations

Product:

Red Hat Enterprise Linux 7

Reporter:

Dave Wysochanski <dwysocha>

Component:

kernel

Assignee:

Scott Mayhew <smayhew>

kernel sub component:

NFS

QA Contact:

Zhi Li <yieli>

Status:

CLOSED INSUFFICIENT_DATA

Docs Contact:

Severity:

urgent

Priority:

high

CC:

agaikwad, awshaikh, baumanmo, bcodding, Bert.Deknuydt, bfields, cww, dweglinski, dwysocha, fsorenso, james-p, jiyin, jwedgeco, plambri, rbergant, revers, rhandlin, rik.theys, rmonther, sgardner, smayhew, steved, swhiteho, tbecker, xzhou, yieli, yoyang

Version:

7.6

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

1552203

Environment:

Last Closed:

2020-01-23 23:30:00 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1552203

Bug Blocks:

1477664, 1594286, 1711360

Attachments:

Description	Flags
tcpdump script to be used with nfs client systemtap script	none
nfs client systemtap script	none
nfs server simulated error injection systemtap script	none
patch to fix race on cl_revoked list	none
Patch to fix this bug against 5.1-rc3	none
one possible code flow that may lead to race condition	none

Description Dave Wysochanski 2019-03-11 11:16:21 UTC

+++ This bug was initially created as a clone of Bug #1552203 +++

Description of problem:
The RHEL7.4 NFS4.1 client can get into a state where it is streaming TEST_STATEIDs for the same stateid over and over and receiving NFS4_OK back from a Linux NFS server, but with the SEQUENCE reply containing SEQ4_STATUS_RECALLABLE_STATE_REVOKED bit set.


Version-Release number of selected component (if applicable):
Seen on 3.10.0-693.2.1.el7 (client and server)
NFSv4.1

How reproducible:
TBD but customer can reproduce fairly regularly

Steps to Reproduce:
TBD

Actual results:
NFS client sends the same TEST_STATEID over and over

Expected results:
NFS client should only send TEST_STATEID one time and then handle the reply or error.

Additional info:
This customer is an openshift customer so they are having problems with openshift as a result.
We have tcpdumps showing this but not sure if we have what leads up to this.  Also customer has been complaining about MySQL being unable to start due to unable to lock files but we're not sure if this is a separate problem or related to this bug.

--- Additional comment from Dave Wysochanski on 2018-03-06 16:49 UTC ---

This bug tracks a similar problem even after the following patch has been added:

commit 95da1b3a5aded124dd1bda1e3cdb876184813140
Author: Andrew Elble <aweits>
Date:   Fri Nov 3 14:06:31 2017 -0400

    nfsd: deal with revoked delegations appropriately
    
    If a delegation has been revoked by the server, operations using that
    delegation should error out with NFS4ERR_DELEG_REVOKED in the >4.1
    case, and NFS4ERR_BAD_STATEID otherwise.
    
    The server needs NFSv4.1 clients to explicitly free revoked delegations.
    If the server returns NFS4ERR_DELEG_REVOKED, the client will do that;
    otherwise it may just forget about the delegation and be unable to
    recover when it later sees SEQ4_STATUS_RECALLABLE_STATE_REVOKED set on a
    SEQUENCE reply.  That can cause the Linux 4.1 client to loop in its
    stage manager.
    
    Signed-off-by: Andrew Elble <aweits>
    Reviewed-by: Trond Myklebust <trond.myklebust>
    Cc: stable.org
    Signed-off-by: J. Bruce Fields <bfields>

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index ecbc7b0..b8281776 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -4016,7 +4016,8 @@ static struct nfs4_delegation *find_deleg_stateid(struct nfs4_client *cl, statei
 {
        struct nfs4_stid *ret;
 
-       ret = find_stateid_by_type(cl, s, NFS4_DELEG_STID);
+       ret = find_stateid_by_type(cl, s,
+                               NFS4_DELEG_STID|NFS4_REVOKED_DELEG_STID);
        if (!ret)
                return NULL;
        return delegstateid(ret);
@@ -4039,6 +4040,12 @@ static bool nfsd4_is_deleg_cur(struct nfsd4_open *open)
        deleg = find_deleg_stateid(cl, &open->op_delegate_stateid);
        if (deleg == NULL)
                goto out;
+       if (deleg->dl_stid.sc_type == NFS4_REVOKED_DELEG_STID) {
+               nfs4_put_stid(&deleg->dl_stid);
+               if (cl->cl_minorversion)
+                       status = nfserr_deleg_revoked;
+               goto out;
+       }
        flags = share_access_to_flags(open->op_share_access);
        status = nfs4_check_delegmode(deleg, flags);
        if (status) {
@@ -4908,6 +4915,16 @@ static __be32 nfsd4_validate_stateid(struct nfs4_client *cl, stateid_t *stateid)
                     struct nfs4_stid **s, struct nfsd_net *nn)
 {
        __be32 status;
+       bool return_revoked = false;
+
+       /*
+        *  only return revoked delegations if explicitly asked.
+        *  otherwise we report revoked or bad_stateid status.
+        */
+       if (typemask & NFS4_REVOKED_DELEG_STID)
+               return_revoked = true;
+       else if (typemask & NFS4_DELEG_STID)
+               typemask |= NFS4_REVOKED_DELEG_STID;
 
        if (ZERO_STATEID(stateid) || ONE_STATEID(stateid))
                return nfserr_bad_stateid;
@@ -4922,6 +4939,12 @@ static __be32 nfsd4_validate_stateid(struct nfs4_client *cl, stateid_t *stateid)
        *s = find_stateid_by_type(cstate->clp, stateid, typemask);
        if (!*s)
                return nfserr_bad_stateid;
+       if (((*s)->sc_type == NFS4_REVOKED_DELEG_STID) && !return_revoked) {
+               nfs4_put_stid(*s);
+               if (cstate->minorversion)
+                       return nfserr_deleg_revoked;
+               return nfserr_bad_stateid;
+       }
        return nfs_ok;
 }



Here is a simple way to know if a tcpdump shows this pattern:

$ tshark -r tcpdump.pcap -Y 'rpc.msgtyp == 1 && nfs' | wc -l
158162
$ tshark -r tcpdump.pcap -Y 'rpc.msgtyp == 1 && nfs.sequence.flags.recallable_state_revoked == 1' | wc -l
158148

Comment 5 Dave Wysochanski 2019-03-11 13:16:16 UTC

Created attachment 1542889 [details]
tcpdump script to be used with nfs client systemtap script

Comment 6 Dave Wysochanski 2019-03-11 13:16:37 UTC

Created attachment 1542890 [details]
nfs client systemtap script

Comment 7 Dave Wysochanski 2019-03-11 13:17:10 UTC

Created attachment 1542891 [details]
nfs server simulated error injection systemtap script

Comment 9 Dave Wysochanski 2019-03-25 13:43:33 UTC

Unfortunately this probably will not make RHEL7.7 unless there is a breakthrough this week where we find missing patches that can be backported.  At this point I don't think we even know if this is an upstream issue or a RHEL only issue.

Comment 10 Dave Wysochanski 2019-03-28 16:56:59 UTC

Hey Bruce, I may have found a race that could lead to this problem.

Background:

The SEQ4_STATUS_RECALLABLE_STATE_REVOKED bit gets set on a SEQUENCE reply only in the following code, when cl_revoked list is not empty:
~~~
3023 __be32
3024 nfsd4_sequence(struct svc_rqst *rqstp,
3025                struct nfsd4_compound_state *cstate,
3026                struct nfsd4_sequence *seq)
3027 {
...
3121 out:
3122         switch (clp->cl_cb_state) {
3123         case NFSD4_CB_DOWN:
3124                 seq->status_flags = SEQ4_STATUS_CB_PATH_DOWN;
3125                 break;
3126         case NFSD4_CB_FAULT:
3127                 seq->status_flags = SEQ4_STATUS_BACKCHANNEL_FAULT;
3128                 break;
3129         default:
3130                 seq->status_flags = 0;
3131         }
3132         if (!list_empty(&clp->cl_revoked))
3133                 seq->status_flags |= SEQ4_STATUS_RECALLABLE_STATE_REVOKED;
3134 out_no_session:
3135         if (conn)
3136                 free_conn(conn);
3137         spin_unlock(&nn->client_lock);
3138         return status;
3139 out_put_session:
3140         nfsd4_put_session_locked(session);
3141         goto out_no_session;
3142 }
~~~

There's one place we add to the list
~~~
fs/nfsd/nfs4state.c
    989 static void revoke_delegation(struct nfs4_delegation *dp)
    990 {
    991         struct nfs4_client *clp = dp->dl_stid.sc_client;
    992
    993         WARN_ON(!list_empty(&dp->dl_recall_lru));
    994
    995         put_clnt_odstate(dp->dl_clnt_odstate);
    996         nfs4_put_deleg_lease(dp->dl_stid.sc_file);
    997
    998         if (clp->cl_minorversion == 0)
    999                 nfs4_put_stid(&dp->dl_stid);
   1000         else {
   1001                 dp->dl_stid.sc_type = NFS4_REVOKED_DELEG_STID;
   1002                 spin_lock(&clp->cl_lock);
   1003                 list_add(&dp->dl_recall_lru, &clp->cl_revoked);
   1004                 spin_unlock(&clp->cl_lock);
   1005         }
   1006 }
~~~

So the question is, could we have added to this list and forgotten to delete something we should have deleted?

Well as it turns out, it looks like a race could occur with the above code and the handling of a FREE_STATEID.

In handling FREE_STATEID, if sc_type == NFS4_REVOKED_DELEG_STID, we call list_del_init(&dp->dl_recall_lru)
~~~
   5082 __be32
   5083 nfsd4_free_stateid(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
   5084                    struct nfsd4_free_stateid *free_stateid)
   5085 {
   5086         stateid_t *stateid = &free_stateid->fr_stateid;
   5087         struct nfs4_stid *s;
   5088         struct nfs4_delegation *dp;
   5089         struct nfs4_client *cl = cstate->session->se_client;
   5090         __be32 ret = nfserr_bad_stateid;
   5091
   5092         spin_lock(&cl->cl_lock);
   5093         s = find_stateid_locked(cl, stateid);
   5094         if (!s)
   5095                 goto out_unlock;
   5096         switch (s->sc_type) {
   5097         case NFS4_DELEG_STID:
   5098                 ret = nfserr_locks_held;
   5099                 break;
   5100         case NFS4_OPEN_STID:
   5101                 ret = check_stateid_generation(stateid, &s->sc_stateid, 1);
   5102                 if (ret)
   5103                         break;
   5104                 ret = nfserr_locks_held;
   5105                 break;
   5106         case NFS4_LOCK_STID:
   5107                 atomic_inc(&s->sc_count);
   5108                 spin_unlock(&cl->cl_lock);
   5109                 ret = nfsd4_free_lock_stateid(stateid, s);
   5110                 goto out;
   5111         case NFS4_REVOKED_DELEG_STID:
   5112                 dp = delegstateid(s);
   5113                 list_del_init(&dp->dl_recall_lru);
   5114                 spin_unlock(&cl->cl_lock);
   5115                 nfs4_put_stid(s);
   5116                 ret = nfs_ok;
   5117                 goto out;
   5118         /* Default falls through and returns nfserr_bad_stateid */
   5119         }
   5120 out_unlock:
   5121         spin_unlock(&cl->cl_lock);
   5122 out:
   5123         return ret;
   5124 }
~~~

What if one thread inside revoke_delegation() races with a second thread inside nfsd4_free_stateid()?

Let's say we have:
thread1:
    989 static void revoke_delegation(struct nfs4_delegation *dp)
    990 {
    991         struct nfs4_client *clp = dp->dl_stid.sc_client;
    992
    993         WARN_ON(!list_empty(&dp->dl_recall_lru));
    994
    995         put_clnt_odstate(dp->dl_clnt_odstate);
    996         nfs4_put_deleg_lease(dp->dl_stid.sc_file);
    997
    998         if (clp->cl_minorversion == 0)
    999                 nfs4_put_stid(&dp->dl_stid);
   1000         else {
   1001                 dp->dl_stid.sc_type = NFS4_REVOKED_DELEG_STID;

thread2:
   5092         spin_lock(&cl->cl_lock);
   5093         s = find_stateid_locked(cl, stateid);
   5094         if (!s)
   5095                 goto out_unlock;
   5096         switch (s->sc_type) {
   5097         case NFS4_DELEG_STID:
   5098                 ret = nfserr_locks_held;
   5099                 break;
   5100         case NFS4_OPEN_STID:
   5101                 ret = check_stateid_generation(stateid, &s->sc_stateid, 1);
   5102                 if (ret)
   5103                         break;
   5104                 ret = nfserr_locks_held;
   5105                 break;
   5106         case NFS4_LOCK_STID:
   5107                 atomic_inc(&s->sc_count);
   5108                 spin_unlock(&cl->cl_lock);
   5109                 ret = nfsd4_free_lock_stateid(stateid, s);
   5110                 goto out;
   5111         case NFS4_REVOKED_DELEG_STID:
   5112                 dp = delegstateid(s);
   5113                 list_del_init(&dp->dl_recall_lru);
   5114                 spin_unlock(&cl->cl_lock);
   5115                 nfs4_put_stid(s);
   5116                 ret = nfs_ok;
   5117                 goto out;
   5118         /* Default falls through and returns nfserr_bad_stateid */
   5119         }
   5120 out_unlock:
   5121         spin_unlock(&cl->cl_lock);


thread1:
   1002                 spin_lock(&clp->cl_lock);
   1003                 list_add(&dp->dl_recall_lru, &clp->cl_revoked);
   1004                 spin_unlock(&clp->cl_lock);
   1005         }
   1006 }
~~~

Thread1 will set sc_type = NFS4_REVOKED_DELEG_STID but lose the race on cl->cl_lock.
Thread2 will win the race and call list_del_init(&dp->dl_recall_lru) but this won't remove from the cl_revoked yet since Thread1 lost the race.
Thread2 will then add to the cl_revoked list and it won't be removed until the client is destroyed.

I think this race also exists in upstream code as it looks similar.  Naive fix would be to move the spinlock a little higher in revoke_delegation() to cover setting of sc_type:
[dwysocha@dwysocha rhel7-kernel]$ git diff
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 13023d6..7b97ed3 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -998,8 +998,8 @@ static void revoke_delegation(struct nfs4_delegation *dp)
        if (clp->cl_minorversion == 0)
                nfs4_put_stid(&dp->dl_stid);
        else {
-               dp->dl_stid.sc_type = NFS4_REVOKED_DELEG_STID;
                spin_lock(&clp->cl_lock);
+               dp->dl_stid.sc_type = NFS4_REVOKED_DELEG_STID;
                list_add(&dp->dl_recall_lru, &clp->cl_revoked);
                spin_unlock(&clp->cl_lock);
        }


I haven't fully worked out what Thread1 actually would be as far as a code path so I don't have a reproducer yet.  But it looks like this race could occur and could lead to what we're seeing - the NFS server has cl_revoked non-empty but the NFS client must have already forgotten about this delegation.  Otherwise the NFS client would send TEST_STATEID on it and it will receive a NFS4ERR_DELEG_REVOKED due to

Comment 11 Dave Wysochanski 2019-03-28 17:01:15 UTC

... Otherwise the NFS client would send TEST_STATEID on it and it will receive a NFS4ERR_DELEG_REVOKED due to the patch for https://bugzilla.redhat.com/show_bug.cgi?id=1552203

One final related thought before I look for a reproducer.  Assume we have that patch from previous comment and thread2 wins the race and sees sc_type == NFS4_DELEG_STID while handling FREE_STATEID, it will return NFS4ERR_LOCKS_HELD to the NFS client.  I don't see any handling of that error in the NFS client either in rhel7 or upstream which seems odd.

Comment 12 Dave Wysochanski 2019-03-28 20:40:01 UTC

We don't have a reproducer and possible patch would need to go upstream first so won't make 7.7

Comment 13 J. Bruce Fields 2019-03-28 21:35:52 UTC

That's interesting.  In https://tools.ietf.org/html/rfc5661#section-18.38 it says "The FREE_STATEID operation is used to free a stateid that no longer has any associated locks (including opens, byte-range locks, delegations, and layouts)."

So if the stateid still represents a real delegation, the client should be sending DELEGRETURN; it should only be sending FREE_STATEID if it knows the delegation has already been revoked--but it can't know that till after a server thread has added the stateid to a revoked list and returned the SEQUENCE reply.

But now that I say that--that STATE_REVOKED bit doesn't tell the client that *something* was revoked, it doesn't necessarily know what.  So the client may just send FREE_STATEID's for all of the state it knows about, ignoring any that return any kind of error?

So the situation here may be that we're revoking a bunch of delegations at once.  The cl_revoked list is already non-empty and the client has already started sending out random FREE_STATEID's, and one of them happens to arrive just as we're revoking another delegation.

That seems like a small window for a race, but maybe if you're doing a ton of revocations it becomes possible?

Your patch would make sense anyway, good find.

Comment 14 J. Bruce Fields 2019-03-28 21:37:50 UTC

(In reply to J. Bruce Fields from comment #13)
> But now that I say that--that STATE_REVOKED bit doesn't tell the client that

(Sorry, "doesn't tell" should have been "only tells".

> *something* was revoked, it doesn't necessarily know what.  So the client
> may just send FREE_STATEID's for all of the state it knows about, ignoring
> any that return any kind of error?
> 
> So the situation here may be that we're revoking a bunch of delegations at
> once.  The cl_revoked list is already non-empty and the client has already
> started sending out random FREE_STATEID's, and one of them happens to arrive
> just as we're revoking another delegation.
> 
> That seems like a small window for a race, but maybe if you're doing a ton
> of revocations it becomes possible?
> 
> Your patch would make sense anyway, good find.

Comment 15 Dave Wysochanski 2019-03-29 00:34:04 UTC

(In reply to J. Bruce Fields from comment #13)
> That's interesting.  In https://tools.ietf.org/html/rfc5661#section-18.38 it
> says "The FREE_STATEID operation is used to free a stateid that no longer
> has any associated locks (including opens, byte-range locks, delegations,
> and layouts)."
> 
> So if the stateid still represents a real delegation, the client should be
> sending DELEGRETURN; it should only be sending FREE_STATEID if it knows the
> delegation has already been revoked--but it can't know that till after a
> server thread has added the stateid to a revoked list and returned the
> SEQUENCE reply.
> 

Hmm.. I was looking today at the NFS client code and found out it looks like it could send FREE_STATEID following a TEST_STATEID that completes with NFS4ERR_EXPIRED or similar error.
Not sure exactly what sequence would cause that or if it's possible or not.
Maybe if the client gets a delegation and a network partition occurs, you won't see a DELEGRETURN but a TEST_STATEID / FREE_STATEID sequence?
Depends on what the server returns though - I saw NFS4ERR_BAD_STATEID for the TEST_STATEID not NFS4ERR_EXPIRED.

nfs41_check_delegation_stateid() -> nfs41_test_and_free_expired_stateid() 

   2477 #if defined(CONFIG_NFS_V4_1)
   2478 static int nfs41_test_and_free_expired_stateid(struct nfs_server *server,
   2479                 nfs4_stateid *stateid,
   2480                 struct rpc_cred *cred)
   2481 {
   2482         int status;
   2483 
   2484         switch (stateid->type) {
   2485         default:
   2486                 break;
   2487         case NFS4_INVALID_STATEID_TYPE:
   2488         case NFS4_SPECIAL_STATEID_TYPE:
   2489                 return -NFS4ERR_BAD_STATEID;
   2490         case NFS4_REVOKED_STATEID_TYPE:
   2491                 goto out_free;
   2492         }
   2493 
   2494         status = nfs41_test_stateid(server, stateid, cred);
   2495         switch (status) {
   2496         case -NFS4ERR_EXPIRED:
   2497         case -NFS4ERR_ADMIN_REVOKED:
   2498         case -NFS4ERR_DELEG_REVOKED:
   2499                 break;
   2500         default:
   2501                 return status;
   2502         }
   2503 out_free:
   2504         /* Ack the revoked state to the server */
   2505         nfs41_free_stateid(server, stateid, cred, true);
   2506         return -NFS4ERR_EXPIRED;
   2507 }


> But now that I say that--that STATE_REVOKED bit doesn't tell the client that
> *something* was revoked, it doesn't necessarily know what.  So the client
> may just send FREE_STATEID's for all of the state it knows about, ignoring
> any that return any kind of error?
> 
> So the situation here may be that we're revoking a bunch of delegations at
> once.  The cl_revoked list is already non-empty and the client has already
> started sending out random FREE_STATEID's, and one of them happens to arrive
> just as we're revoking another delegation.
> 
Hmmm, yeah I'll have to check that scenario and see what can generate the FREE_STATEIDs.

> That seems like a small window for a race, but maybe if you're doing a ton
> of revocations it becomes possible?
> 
> Your patch would make sense anyway, good find.

Ok cool.  I'll study this a bit more tomorrow then assuming I can come up with a header that makes sense I'll send that 1-liner to the list.

Comment 17 Dave Wysochanski 2019-04-01 13:06:11 UTC

Created attachment 1550559 [details]
patch to fix race on cl_revoked list

Comment 18 Dave Wysochanski 2019-04-02 01:51:08 UTC

Created attachment 1550822 [details]
Patch to fix this bug against 5.1-rc3

Comment 19 Dave Wysochanski 2019-04-02 18:11:21 UTC

Created attachment 1551104 [details]
one possible code flow that may lead to race condition

Comment 21 Dave Wysochanski 2019-04-03 17:03:11 UTC

So far no luck reproducing this or showing that the race condition can occur.  Scott and I have been talking on IRC but so far neither of us can repro this.  To my knowledge so far I've not even been able to successfully see FREE_STATEID sent from the client or even cl_revoked populated.  It looks like maybe for this to occur you need:
1. NFS server to trigger recalls of delegation, which will populate the del_recall_lru list
- This is easily done with having an NFS client hold files open (i.e. tail -f) while a second NFS client or on the server either truncates or writes to the file
2. The del_recall_lru list remains populated for at least /proc/fs/nfsd/nfsv4leasetime so that nfs4_laundromat runs and calls revoke_delegation which adds to cl_revoked

So far #2 seems elusive but I think scott got this at one point

Comment 22 Dave Wysochanski 2019-04-03 19:51:58 UTC

Ok so for #2 scott tried a small stap on the client and it's really effective to get a good portion of the NFS client behavior:
stap -gve 'probe module("nfsv4").function("nfs4_callback_recall") { printf("%s: delaying\n", ppfunc()); mdelay(10); }'

I ran the above on an NFS client (957.5.1.el7) plus this:
for i in $(seq 1 5000); do tail -f /mnt/nfs/test/file-$i.bin & done

Then on a second NFS client did this:
for i in $(seq 1 5000); do date >> /mnt/nfs/test/file-$i.bin & done

I didn't get the hang yet but I think scott may have.  In a tcpdump I see TEST_STATEID reply with NFS4ERR_DELEG_REVOKED followed immediately by FREE_STATEID, and SEQUENCE replies with SEQ4_STATUS_RECALLABLE_STATE_REVOKED set.  The stream of operations clears for me though and I don't get a hang - eventually at the end I just see an idle connection with SEQUENCE sent every 10s (lease time is 15 seconds) and there is no SEQ4_STATUS_RECALLABLE_STATE_REVOKED bit set.  The only piece missing for the race to happen is the nfsd's nfs4_laundromat thread to get delayed on the cl_lock inside revoke_delegation().  This seems like it may be a stretch but its definitely possible to get contention on cl_lock and the other missing pieces are there now.

Comment 23 Dave Wysochanski 2019-04-04 17:17:45 UTC

(In reply to Dave Wysochanski from comment #22)
> Ok so for #2 scott tried a small stap on the client and it's really
> effective to get a good portion of the NFS client behavior:
> stap -gve 'probe module("nfsv4").function("nfs4_callback_recall") {
> printf("%s: delaying\n", ppfunc()); mdelay(10); }'
> 

While the above is an artificially injected delay, I think it's at least possible such a delay could happen.  Consider the below code path.

nfs_delegation_find_inode_server: loop that is O(# delegations, and delegation->lock)
* We would not expect delegation->lock to be highly contended, but it's possible some contention due to state manager running through the list at same time
  * nfs_delegation_find_inode -> nfs_delegation_find_inode_server: only called from nfs4_callback_recall (should only be called once with Linux NFS server) and nfs4_callback_getattr (should be never called with Linux NFS server)
* It is possible inode->i_lock is contended though it should only be one spinlock per nfs4_callback_recall


     71 __be32 nfs4_callback_recall(struct cb_recallargs *args, void *dummy,
     72                             struct cb_process_state *cps)
     73 {
     74         struct inode *inode;
     75         __be32 res;
     76 
     77         res = htonl(NFS4ERR_OP_NOT_IN_SESSION);
     78         if (!cps->clp) /* Always set for v4.0. Set in cb_sequence for v4.1 */
     79                 goto out;
     80 
     81         dprintk_rcu("NFS: RECALL callback request from %s\n",
     82                 rpc_peeraddr2str(cps->clp->cl_rpcclient, RPC_DISPLAY_ADDR));
     83 
     84         res = htonl(NFS4ERR_BADHANDLE);
     85         inode = nfs_delegation_find_inode(cps->clp, &args->fh);
...

    781 /**
    782  * nfs_delegation_find_inode - retrieve the inode associated with a delegation
    783  * @clp: client state handle
    784  * @fhandle: filehandle from a delegation recall
    785  *
    786  * Returns pointer to inode matching "fhandle," or NULL if a matching inode
    787  * cannot be found.
    788  */
    789 struct inode *nfs_delegation_find_inode(struct nfs_client *clp,
    790                                         const struct nfs_fh *fhandle)
    791 {
    792         struct nfs_server *server;
    793         struct inode *res = NULL;
    794 
    795         rcu_read_lock();
    796         list_for_each_entry_rcu(server, &clp->cl_superblocks, client_link) {
    797                 res = nfs_delegation_find_inode_server(server, fhandle);
    798                 if (res != NULL)
    799                         break;
    800         }
    801         rcu_read_unlock();
    802         return res;
    803 }


    761 static struct inode *
    762 nfs_delegation_find_inode_server(struct nfs_server *server,
    763                                  const struct nfs_fh *fhandle)
    764 {
    765         struct nfs_delegation *delegation;
    766         struct inode *res = NULL;
    767 
    768         list_for_each_entry_rcu(delegation, &server->delegations, super_list) {
    769                 spin_lock(&delegation->lock);
    770                 if (delegation->inode != NULL &&
    771                     nfs_compare_fh(fhandle, &NFS_I(delegation->inode)->fh) == 0) {
    772                         res = igrab(delegation->inode);
    773                 }
    774                 spin_unlock(&delegation->lock);
    775                 if (res != NULL)
    776                         break;
    777         }
    778         return res;
    779 }

   1230 struct inode *igrab(struct inode *inode)
   1231 {
   1232         spin_lock(&inode->i_lock);
   1233         if (!(inode->i_state & (I_FREEING|I_WILL_FREE))) {
   1234                 __iget(inode);
   1235                 spin_unlock(&inode->i_lock);
   1236         } else {
   1237                 spin_unlock(&inode->i_lock);
   1238                 /*
   1239                  * Handle the case where s_op->clear_inode is not been
   1240                  * called yet, and somebody is calling igrab
   1241                  * while the inode is getting freed.
   1242                  */
   1243                 inode = NULL;
   1244         }
   1245         return inode;
   1246 }
   1247 EXPORT_SYMBOL(igrab);

Comment 24 Dave Wysochanski 2019-04-04 17:22:10 UTC

Customer reproduced with the patch from comment #17 and gave us tcpdump and mountstats data that contains a FREE_STATEID op count of 0.  So it's safe to say the customer is hitting something other than the theoretical race I described earlier.  I'm actually not sure the data the customer provided shows a loop though - under "normal operation" due to how this bit works, there could be a crap ton of TEST_STATEIDs and SEQUENCE ops with RECALLABLE_STATE_REVOKED set if there's a ton of delegations (which sounds like is the case for the customer).

Right now I'm looking at this patch as a possibility since it's not yet backported to RHEL7 and it affects the clients CB_RECALL reply status code and whether a client will send a DELEGRETURN.

commit 6c342655022d5189c45e4f7ed0cc8048c9ad9815
Author: Trond Myklebust <trond.myklebust>
Date:   Thu Jun 7 14:22:00 2018 -0400

    NFSv4: Return NFS4ERR_DELAY when a delegation recall fails due to igrab()
    
    If the attempt to recall the delegation fails because the inode is
    in the process of being evicted from cache, then use NFS4ERR_DELAY
    to ask the server to retry later.
    
    Signed-off-by: Trond Myklebust <trond.myklebust>

Comment 25 Scott Mayhew 2019-04-04 20:30:38 UTC

I've been able to reproduce this a couple times now.  My client is running 3.10.0-957.5.1.el7.x86_64 and my server is running 4.20.16-200.fc29.x86_64.  Both VMs have 4 CPUs and 8G of memory.  The server has 64 nfsd threads and a 15 second lease time.

On the client I'm running in one window:

# stap -gve 'probe module("nfsv4").function("nfs4_callback_recall") { printf("%s: delaying\n", ppfunc()); mdelay(10); }'

and in another window:

# for i in `seq -w 1 10000`; do sleep 2m </mnt/t/dir1/file.$i & done

(Note: I already created the files ahead of time)

As soon as the bash prompt returns on the client, I run the following on the server:

# for i in `seq -w 1 10000`; do date >/export/dir1/file.$i & done

At first the server starts doing CB_RECALLs and the client starts doing DELEGRETURNs.  Then the server begins revoking delegations and the SEQ4_STATUS_RECALLABLE_STATE_REVOKED flag is set.  The client starts doing TEST_STATEID and FREE_STATEID (when the response from the server is NFS4ERR_DELEG_REVOKED).  I even see the SEQ4_STATUS_RECALLABLE_STATE_REVOKED flag get cleared briefly, but it is set again a short time later.  Eventually I see the last couple of CB_RECALLs returning NFS4ERR_DELAY, and finally a CB_RECALL returning NFS4ERR_SEQ_MISORDERED.  After the NFS4ERR_SEQ_MISORDERED, I see no more CB_RECALLs.  Eventually after all the sleep commands finish, I see no more OPEN or TEST_STATEID ops, but the bare SEQUENCE ops to keep the lease active still have the SEQ4_STATUS_RECALLABLE_STATE_REVOKED flag set in the replies.  So I took a vmcore of the server at this point.

crash> p nfsd_net_id
nfsd_net_id = $1 = 10
crash> p &init_net
$2 = (struct net *) 0xffffffffbe314d40
crash> p $2->gen
$3 = (struct net_generic *) 0xffff92853750f180
crash> p $3->ptr[10]
$4 = (void *) 0xffff928532619200
crash> p &((struct nfsd_net *)0xffff928532619200)->client_lru
$5 = (struct list_head *) 0xffff928532619288
crash> list -H 0xffff928532619288 -o nfs4_client.cl_lru
ffff9285245da958
crash> p &((struct nfs4_client *)0xffff9285245da958)->cl_revoked
$6 = (struct list_head *) 0xffff9285245da9c0
crash> list -H 0xffff9285245da9c0 -o nfs4_delegation.dl_recall_lru|wc -l
8470

That's 8470 delegations in the cl_revoked list.

crash> p &((struct nfs4_client *)0xffff9285245da958)->cl_cb_waitq
$7 = (struct rpc_wait_queue *) 0xffff9285245dad38
crash> rpc_wait_queue.qlen 0xffff9285245dad38
  qlen = 8315

8315 tasks on the cl_cb_waitq.  None of them are making progress.  I know this because earlier I had a live crash session running on the server and there were no changes for hours.  Probably because cl_cb_slot_busy is set:

crash> nfs4_client.cl_cb_slot_busy 0xffff9285245da958
  cl_cb_slot_busy = 1
crash> px &((struct rpc_wait_queue *)0xffff9285245dad38)->tasks[0]
$8 = (struct list_head *) 0xffff9285245dad40

all of the tasks have a tk_action of rpc_prepare_task.  So those haven't been transmitted yet (which makes sense, because I definitely didn't see 10,000 CB_RECALLs in wireshark).

crash> list -H 0xffff9285245dad40 -o rpc_task.u -s rpc_task.tk_action |head
ffff92851b978f00
  tk_action = 0xffffffffc06bcff0 <rpc_prepare_task>
ffff92851b978e00
  tk_action = 0xffffffffc06bcff0 <rpc_prepare_task>
ffff92851b978800
  tk_action = 0xffffffffc06bcff0 <rpc_prepare_task>
ffff92851b978a00
  tk_action = 0xffffffffc06bcff0 <rpc_prepare_task>
ffff92851b978000
  tk_action = 0xffffffffc06bcff0 <rpc_prepare_task>

I think maybe there's something wrong with how NFS4ERR_SEQ_MISORDERED is handled in nfsd4_cb_sequence_done().

Comment 30 Dave Wysochanski 2019-04-09 15:00:33 UTC

I have been reviewing the list of commits upstream but not yet backported to RHEL7.

This is a possibly interesting NFS client commit, given:
1) customer's indication that the repro involves interrupting a build process
2) Scott's observation of SEQ_MISORDERED which is an unusual error


commit 3453d5708b33efe76f40eca1c0ed60923094b971
Author: Trond Myklebust <trond.myklebust>
Date:   Wed Jun 20 17:53:34 2018 -0400

    NFSv4.1: Avoid false retries when RPC calls are interrupted
    
    A 'false retry' in NFSv4.1 occurs when the client attempts to transmit a
    new RPC call using a slot+sequence number combination that references an
    already cached one. Currently, the Linux NFS client will do this if a
    user process interrupts an RPC call that is in progress.
    The problem with doing so is that we defeat the main mechanism used by
    the server to differentiate between a new call and a replayed one. Even
    if the server is able to perfectly cache the arguments of the old call,
    it cannot know if the client intended to replay or send a new call.
    
    The obvious fix is to bump the sequence number pre-emptively if an
    RPC call is interrupted, but in order to deal with the corner cases
    where the interrupted call is not actually received and processed by
    the server, we need to interpret the error NFS4ERR_SEQ_MISORDERED
    as a sign that we need to either wait or locate a correct sequence
    number that lies between the value we sent, and the last value that
    was acked by a SEQUENCE call on that slot.
    
    Signed-off-by: Trond Myklebust <trond.myklebust>
    Tested-by: Jason Tibbitts <tibbs.edu>

Comment 50 Dave Wysochanski 2020-01-23 23:30:00 UTC

We do not have a reproducer and the only person hitting this has said they can no longer reproduce after 3.10.0-1062.4.1.el7

We have patches that fix issues in this area of code, but as far as I understand there was a series involved and cost/benefit unclear at this point.